It's really confusing that every Google document for dataflow is saying that it's based on Apache Beam now and directs me to Beam website. Also, if I looked for github project, I would see the google dataflow project is empty and just all goes to apache beam repo. Say now I need to create a pipeline, from what I read from Apache Beam, I would do : from apache_beam.options.pipeline_options
However, if I go with google-cloud-dataflow, I'll have error: no module named 'options'
, turns out I should use from apache_beam.utils.pipeline_options
. So, looks like google-cloud-dataflow is with an older beam version and is going to be deprecated?
Which one should I pick do develop my dataflow pipeline?
Ended up finding answer in Google Dataflow Release Notes
The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:
- The core SDK
- DirectRunner and DataflowRunner
- I/O components for other Google Cloud Platform services
The Cloud Dataflow SDK distribution does not include other Beam components, such as:
Runners for other distributed processing engines
I/O components for non-Cloud Platform services
Version 2.0.0 is based on a subset of Apache Beam 2.0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With