I have a couple of DAGs that create temporary AWS EMR Clusters and then terminates them once they're finished running. I would like to create a new DAG that runs daily and generates a report of every EMR Cluster created for that day along with how long it ran and send this report to various people in an email.
I need to store the EMR Cluster ID value though so that my report generator has a list of every EMR Cluster ID for that day. I'm wondering if it's possible for me to modify an Airflow Variable to store this information e.g., I could have an Airflow variable where the key is "EMR_CLUSTERS" and the value is a JSON string with all the data I want to record. Or could I use the Airflow metadatabase that's already being used to write to a new table there where I can write this information?
What are my options for storing permanent data in Airflow?
Either of the options you mentioned would work:
A third option would be network storage. If you are running airflow distributed, then it's possible you are storing DAGs on network storage, and mounting it into the workers/scheduler/webserver. In which case, putting a file based report on this storage (and possibly emailing it out etc.) would be a solid bet.
You could write a plugin that would work with any of these 3, and it could display what got written/sent when.
Variables
Easily read/written, but bit sloppy to overwrite it everyday IMO.
MetadataDB
Use SQLAlchemy to create and read/write tables storing this information. You can get a session on the airflow metadata DB by doing:
from airflow import settings
session = settings.Session()
Network Storage
In this case just read/write files normally.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With