Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing AWS credentials to Google Cloud Dataflow, Python

I use Google Cloud Dataflow implementation in Python on Google Cloud Platform. My idea is to use input from AWS S3.

Google Cloud Dataflow (which is based on Apache Beam) supports reading files from S3. However, I cannot find in documentation the best possiblity to pass credentials to a job. I tried adding AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to environment variables within setup.py file. However, it work locally, but when I package Cloud Dataflow job as a template and trigger it to run on GCP, it sometimes work, and sometimes not, raising "NoCredentialsError" exception and causing job to fail.

Is there any coherent, best-practice solution to pass AWS credentials to Python Google Cloud Dataflow job on GCP?

like image 588
Stanisław Smyl Avatar asked Dec 19 '25 04:12

Stanisław Smyl


1 Answers

The options to configure this have been added finally. They are available on Beam versions after 2.26.0.

The pipeline options are --s3_access_key_id and --s3_secret_access_key.


Unfortunately, the Beam 2.25.0 release and earlier don't have a good way of doing this, other than the following:

In this thread a user figured out how to do it in the setup.py file that they provide to Dataflow in their pipeline.

like image 182
Pablo Avatar answered Dec 21 '25 18:12

Pablo