Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ETL in Airflow aided by Jupyter Notebooks and Papermill

So my issue is that I build ETL pipelines in Airflow, but really develop and test the Extract, Transform and Load functions in Jupyter notebooks first. So I end up copy-pasting back and forth all the time, between my Airflow Python operator code and Jupyter notebooks, pretty inefficient! My gut tells me that all of this can be automated.

Basically, I would like to write my Extract, Transform and Load functions in Jupyter and have them stay there, while still running the pipeline in Airflow and having the extract, transform and load tasks show up, with retries and all the good stuff that Airflow provides out of the box.

Papermill is able to parameterize notebooks, but I really can't think of how that would help in my case. Can someone please help me connect the dots? 🙏🏻

like image 633
rimkashox Avatar asked Nov 02 '25 14:11

rimkashox


1 Answers

A single master Jupyter notebook, with any number of slave notebooks (used as templates), executed in sequence using papermill.execute_notebook, should be sufficient to automate any ML pipeline.

To pass information between pipeline stages (from one slave notebook to the next one(s)), it's possible to use another Netflix package, scrapbook, which allows us to record python objects in slave notebooks (as they are processed by papermill) and then to retrieve these objects from slaves in the pipeline master (saving uses scrapbook.glue and reading - scrapbook.read_notebook).

Resuming from any completed stage is also possible but it requires storing necessary inputs saved during previous stage(s) in a predictable place reachable from the master (e.g. in a local master JSON file or in MLflow).

The master notebook can be also scheduled with a cron job, e.g. from Kubernetes).

  • Alternatives

Airflow is probably an overkill for most ML teams due to admin costs (5 containers, incl. 2 databases), while other (non-Netflix) python packages would either require more boilerplate (Luigi) or extra priviledges and custom docker images for executors (Elyra), while Ploomber would expose to few-maintainers risk.

like image 161
mirekphd Avatar answered Nov 04 '25 07:11

mirekphd



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!