Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow Persistent Data Storage Across DAGs

Tags:

airflow

I have a couple of DAGs that create temporary AWS EMR Clusters and then terminates them once they're finished running. I would like to create a new DAG that runs daily and generates a report of every EMR Cluster created for that day along with how long it ran and send this report to various people in an email.

I need to store the EMR Cluster ID value though so that my report generator has a list of every EMR Cluster ID for that day. I'm wondering if it's possible for me to modify an Airflow Variable to store this information e.g., I could have an Airflow variable where the key is "EMR_CLUSTERS" and the value is a JSON string with all the data I want to record. Or could I use the Airflow metadatabase that's already being used to write to a new table there where I can write this information?

What are my options for storing permanent data in Airflow?

like image 523
Kyle Bridenstine Avatar asked Oct 19 '25 09:10

Kyle Bridenstine


1 Answers

Either of the options you mentioned would work:

  1. Airflow Variable
  2. Metadata DB

A third option would be network storage. If you are running airflow distributed, then it's possible you are storing DAGs on network storage, and mounting it into the workers/scheduler/webserver. In which case, putting a file based report on this storage (and possibly emailing it out etc.) would be a solid bet.

You could write a plugin that would work with any of these 3, and it could display what got written/sent when.

Variables

Easily read/written, but bit sloppy to overwrite it everyday IMO.

MetadataDB

Use SQLAlchemy to create and read/write tables storing this information. You can get a session on the airflow metadata DB by doing:

from airflow import settings
session = settings.Session()

Network Storage

In this case just read/write files normally.

like image 50
jhnclvr Avatar answered Oct 22 '25 03:10

jhnclvr