I write custom sink with python sdk. I try to store data to AWS S3. To connect S3, some credential, secret key, is necessary, but it's not good to set in code for security reason. I would like to make the environment variables reach Dataflow workers as environment variables. How can I do it?
To set and get environment variables in Python you can just use the os module: import os # Set environment variables os. environ['API_USER'] = 'username' os. environ['API_PASSWORD'] = 'secret' # Get environment variables USER = os.
With python code, environment variables can be set and manipulated. Setting the environment variable with code makes it more secure and it does not affect the running python script.
Generally, for transmitting information to workers that you don't want to hard-code, you should use PipelineOptions - please see Creating Custom Options. Then, when constructing the pipeline, just extract the parameters from your PipelineOptions object and put them into your transform (e.g. into your DoFn or a sink).
However, for something as sensitive as a credential, passing sensitive information in a command-line argument might be not a great idea. I would recommend a more secure approach: put the credential into a file on GCS, and pass the name of the file as a PipelineOption. Then programmatically read the file from GCS whenever you need the credential, using GcsIO.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With