Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

google-cloud-dataflow vs apache-beam

It's really confusing that every Google document for dataflow is saying that it's based on Apache Beam now and directs me to Beam website. Also, if I looked for github project, I would see the google dataflow project is empty and just all goes to apache beam repo. Say now I need to create a pipeline, from what I read from Apache Beam, I would do : from apache_beam.options.pipeline_options However, if I go with google-cloud-dataflow, I'll have error: no module named 'options' , turns out I should use from apache_beam.utils.pipeline_options. So, looks like google-cloud-dataflow is with an older beam version and is going to be deprecated?

Which one should I pick do develop my dataflow pipeline?

like image 519
foxwendy Avatar asked Sep 11 '25 15:09

foxwendy


1 Answers

Ended up finding answer in Google Dataflow Release Notes

The Cloud Dataflow SDK distribution contains a subset of the Apache Beam ecosystem. This subset includes the necessary components to define your pipeline and execute it locally and on the Cloud Dataflow service, such as:

  • The core SDK
  • DirectRunner and DataflowRunner
  • I/O components for other Google Cloud Platform services

The Cloud Dataflow SDK distribution does not include other Beam components, such as:

  • Runners for other distributed processing engines

  • I/O components for non-Cloud Platform services

Version 2.0.0 is based on a subset of Apache Beam 2.0.0

like image 64
foxwendy Avatar answered Sep 13 '25 12:09

foxwendy