I am a data engineer and work with airflow regularly.
When redeploying dags with a new start date the best practice is as shown in the here:
Don’t change start_date + interval: When a DAG has been run, the scheduler database contains instances of the run of that DAG. If you change the start_date or the interval and redeploy it, the scheduler may get confused because the intervals are different or the start_date is way back. The best way to deal with this is to change the version of the DAG as soon as you change the start_date or interval, i.e. my_dag_v1 and my_dag_v1. This way, historical information is also kept about the old version.
However after deleting all previous DAG and task runs I tried to redeploy a dag with a new start date. It worked as expected (with the new start date) for a day, then started to work with the old again
What are the reasons for this? In depth if you can.
Airflow maintains all of the information regarding the past runs in a table dag_run
.
When you clear the previous dag runs, these entries are dropped from the database. Hence, airflow treats this dag as a new dag and starts at the specified time.
Airflow checks the last dag execution time (start_date
of last run) and adds the timedelta
object which you have specified in schedule_interval
.
If you are having difficulties even after clearing dag runs, few things you can do:
schedule_interval
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With