What is Beam?
  1. a model for defining parallel processing pipline
    1. batch
    2. straming
  2. support many SDK, including java sdk
  3. can be excuted by many backend, including dataflow
What is dataflow?
  1. a managed service
  2. used for executing many dataprocessing patterns
    1. Apache Beam
  3. process data from
    1. many GCP data stores
    2. messaging services
      1. BigQuery
      2. Cloud Storage
      3. Pub/Sub
  4. store output data and cache pipeline code in storage bucket.
cloud storage buket is what?
  1. gsutil mb gs://artful-shelter-271912 this command create a cloud storage cucket....still know
  2. create project and download apache beam SDK:mvn archetype:generate \ -DarchetypeGroupId=org.apache.beam \ -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ -DgroupId=com.example \ -DartifactId=dataflow-intro \ -Dversion="0.1" \ -DinteractiveMode=false \ -Dpackage=com.example
  3. launch pipeline on service                                    mvn compile exec:java \ -Dexec.mainClass=com.example.WordCount \ -Dexec.args="--project=artful-shelter-271912 \ --gcpTempLocation=gs://artful-shelter-271912/tmp/ \ --output=gs://artful-shelter-271912/output \ --runner=DataflowRunner \ --jobName=dataflow-intro" \ -Pdataflow-runner
-- Sam 22:35 22/03/2020

 今天来把血命给他要了
就看这个apply (String, Ptranform)到底是怎么玩的。
  1. 是不是说apply 任何一种Ptrsanform,最终都是为了调用PTransform里面的某个方法?
  2. 有那么一次我们调用了Pardo.of()方法来返回一个Ptransfrom实例。
    1. Pardo.of()的参数要求是什么?是一个Dofn实例。
    2. 那么这一次我们最终调用的核心方法是PTransfrom实例里面的Dofn的实例的processElement方法。是不是意味着每个apply(Ptransform)都是为了调用其里面的Dofn实例的ProcessElement方法呢?
    3. 难道没个Ptransform实例里面都会有个Dofn实例吗?--------可能Dofn是个接口?No,Dofn不是接口也没有继承,是个Pardo绑定的专用实例。
    4. 所以可以说Pardo的of方法,返回的是一个ParDo.SingleOutput实例,之所以不直接new这个实例,而是用of方法,就是为了塞进去一个Dofn实例到返回的SingleOutput里面去。我不知道Ptransform里面的方法的调用机制,但是能确定这个SingleOutput会让Ptransform里的必然被调的方法去调用processElement方法。
    5. 根据调试结果,如果你传一个带ProcessElement方法的实例给PTransfer实例(比如
-- Sam 09:58 27/04/2020
首先,这个pub/sub to bigquery 怎么那么不靠谱?以前仅仅是不能立即显示,但是通过select count(*) 还能看到变化,现在好了,无论如何都不work了。
  1. 首先确定一点,那个bigquery的preview是一定不可靠的。用select count(*) 或者 select * from pixel.raw_events 都得到29条记录,同时,用preview,就只有19条。刷新也没用。这实锤是个bug。
  2. 昨天发现老的dataflow不work了,重新建立了个sub+flow+table,work, 然后重新建了个sub+flow+老table,也work(当时唯一的问题是怀疑如果先有了message,再运行flow会buwork。)谁曾想,再试验时,发现啥都不work了。我现在只能用zzz再实验两个字段的表了。
  3. Now, I deleted everything, created a new topic, created some new topic, when what ever I do, it all works, just the preview don't tell the truth sometime. while after calling some select * from [table] order by stamp desc, then it's also OK.
  4. Gos knows if it will fail down again after days.......................?????????????
-- Sam 14:23 24/04/2020
(actually everything's good, just use select count(*) will be able to trigger the batch and see the change.)

 let's from beginning again:
  1. new topic aaa + new db (with only 1 column) = works!    added more columns, still works.
  2. if you create a bigquery table which have already existed before, but was deleted already, then don't be surprised if you found there unknow records in the new created table!!!!

existing topic: pixel-raw-events, + new subscription for it: pixel_raw_events_sub, + new table pixel_raw_events -------> send event, no data appear in bq, -----------> goto subscription + do pull -------------->no message appear, but the data in bq suddenly appear! -------------> send event again, works!
  1. is do pull the key which will make the data appear?-------no, it just need time, the case I met was around 3 minute. you can try if pull can make it happen faster? thought I think shouldn't. 
  2. if I swith idsync to send message to other topic, then switch back, should still work, right?----yes.
  3. can I have multi subscriptions?---------yes, but need to prove when each have a job to consume, they will not interrupt each other.
    1. I have two subscription: aaa_sub3 aaa_sub4.
    2. make a job for aaa_sub3. and stream data to table aaa. to see if it can save message into it, should be no. (NO!, the new message is consumed after a while, but no record in aaa, seems the aaa table is not good anymore?)
    3. make a job for aaa_sub4. and stream data to new table fff, this should be OK, then check if the other tables will be wake up for refresh?---------------before it was all ok, but this time the record didn't appear in fff, an no other table waked up :(, don't know why....I noticed that the message appear in subscription for a while, then it disppeared....while not as aspected appear in fff table, so where did it go?
    4. where 
  4. is a new data base table necessary?
  5. if there's alreay messages under a subscription, now create a new table for it, will not work?
wait for hours, then found that the pipeline do not works anymore, send event, nothing appears in BQ table, (also nothing displayed in subscription, that should mean the message is cosumed?
  1. could it because that the pipeline went to sleep if there's no coming data?------No, it might actually works, just it do not refresh quickly.....because as I remember, the records number in pixel.pixel_raw_events was 10 this afternoon, I tried to send a new message to pixel_raw_events topic, didn't work at that time............!!!!! While when I check again at 23:30, it become 12 records!!!!!!(approved again at 2:03 4/18/2020)
  2. beside the record count in pixel.raw.events changed, seems the other job all stopped working, so I'll write down current records number to compare after hours. to see if it's really because bigquery doesn't refresh quick if there's no operation for a long time.
    1. aaa 0 rows
    2. bbb 2 rows
    3. ccc 20 rows
    4. ddd  36 rows
    5. eee: 5 rows
    6. pixel_raw_events  12 rows
    7. test 0 rows
    8. I send at least 10 message to ddd, eee and pixel_raw_events!----wait for hours to see if it changes..------(YES, changed after hours,  but pixel_raw_event is still 12, don't know why...then after more hours, the pixel_raw_event changed finally to 22.)
  3. I created new subscription under topic aaa, then, found the releven tables(new created) no response, why???? --------(I'll leave these tables there to check it again tomorrow.....)
  4. As a truth that the bigquery stop response quickly soon after the job created or event as soon as the job created?  then how can we do test for an existing running job? (we send an request, then no response in the bigquery?????)
-- Sam 13:10 20/04/2020
 why the hell can not persist into bigquery?
  1. I created a table test and a new topic, it works.
  2. the second pipeline and saving to the same table? do NOT work. 
  3. same topic to an other new same structured table?  -------do NOT work.
  4. delete the first two  job. could the last job work?-----------still NOT work, Maybe I should delete the last two jobs first to see if the first coudl start to work agian?...................or.......maybe the job will die if it idle too long time?...no...nuts!
  5. delete all the job and subject, then created a new subscriptioin, work, it can receive message.
  6. delete the message, and create a dataflow, use test table, not work.
Fuck up!
  1. let's do it again, createa new topic bbb and a new table b, and create a new dataflow from bbbb to bbb- works, but take pretty long time.
  2. created a second subscription bbb-s, it also got messages. now I delete this subscription to see if the other one will be affected
  3. fffffuuuucccccckkkkk!!!!!, it's all of a sudden, the test table start to display 4 records!!!!!!4 old records!, because the new records should display data of good format!.....................while I tried to create a new pipline to writ info into it, no display.......how can I know if it will display some time in futer?
  4. I created a new table ccc, and created a new pipline bbb-ccc. it works
  5. delete the bbb-s, to see if bbb-ccc still works. it works
Fuck up!!!!!!!
  1. it doesn't work again!!!!!!!  I created a new topic aaa, a new table aaa, then createa default dataflow, it stop working.
  2. I first find out with old topic, it works with a subscription, bug do not persist into new created raw_events table neither do test table....
-- city hunter 11:06 17/04/2020
 The problems when working with beam:
  1. parameters, 
    1. table name or dataset?:
      1. according to the discussion with Mikayal, the query could be too complex, 
      2. so if the parameter is dataset, maybe we need to make the data ready in one bigqury table or view first, then we read rows from that table or view. 
      3. it should be by an other step or an other job. ( we have the job to update the segment every day alreay)
        1. is it possible to do it with only another step (apply not an other job)
    2. started date
      1. the started clomn date in segment table changs everyday, and will always be today's time. so the started date parameter will not work.
  2. the cassandraIO will work on an collection of Entity, so it's not possible to get each row a different TTL, we can put only expiry_date.
    1. should be update, is it update? or append? or trunk and regenerate?
-- Sam 11:37 09/04/2020
 问题来了,
  1. 就两行数据的简单无比的bigquery table,怎么地这个flow处理起来要那么多秒????
  2. Now we have the product table in Big query, so we don't save to Cassandra any more?
  3. create a pull request for 161
  4. code review for 2 PR from francis.
  5. why we need spring boot project for segment
  6. I need to add a document in wiki under segment to introduce the export from big query to Cassandra.
  7. if we use BigTable, we will not use Cassandra, so we are still comparing, right?
So we can understand like this:
  1. we will get  data from adtool, from atlas from tj, from other like webpage.
  2. where do we save those data? casssandra or bigtable(we need to check the lastet commit of francis)
  3. then eventually, we will append them into big query (from bigtable or from cassandra), and will geneate the profiles and segments from it.
  4. then I export the segment from bigquery to cassandra again, for segment api to use
  5. the question's coming? why the segment api want to access with cassandra not bigTable?
Really should think about one question, what will the world be looks like?
-- Sam 17:14 03/04/2020
 为什么要写log,就是为了保持对大局观的清醒,否则一旦迷失了自己,就没有地方安放那些刚接收的信息,而这些没有被组织起来安放的信息特别容易被丢失。
-- Sam 10:28 27/03/2020
 About the Beam
  1. I have created the pipline from Options,
  2. have applied a read from bigquery (not confident with how it works, maybe after the debugging will be more confident)
  3. applied a transform to tranlate each row into a person object
  4. applied a writter to write the person into the Cassandra.
So next:
  1. read about how to debug, to know how many code works
  2. figure out how to join many bigquery tables
  3. figure out how to calculate the TTL
  4. understand again the ticket.

-- Sam 10:45 25/03/2020

Please click here to login and add comments! || automatically refresh content every seconds