I'm somewhat inexperienced with both Docker and Airflow, so this might be a silly question. I have a Dockerfile that uses the apache/airflow image together with some of my own DAGs. I would like to launch the airflow web server together with the scheduler and I'm having trouble with this. I can get it working, but I feel that I'm approaching this incorrectly.
Here is what my Dockerfile looks like:
FROM apache/airflow
COPY airflow/dags/ /opt/airflow/dags/
RUN airflow initdb
Then I run docker build -t learning/airflow .. Here is the tough part: I then run docker run --rm -tp 8080:8080 learning/airflow:latest webserver and in a separate terminal I run docker exec `docker ps -q` airflow scheduler. The trouble is, that in practice this generally happens on a VM somewhere, so opening up a second terminal is just not an option and multiple machines will probably not have access to the same docker container. Running webserver && scheduler does not seem to work, the server appears to be blocking and I'm still seeing the message "The scheduler does not appear to be running" in Airflow UI.
Any ideas on what the right way to run server and scheduler should be?
Many thanks!
First, thanks to @Alex and @abestrad for suggesting docker-compose here -- I think this is the best solution. I finally managed to get it working by referring to this great post. So here is my solution:
First, my Dockerfile looks like this now:
FROM apache/airflow
RUN pip install --upgrade pip
RUN pip install --user psycopg2-binary
COPY airflow/airflow.cfg /opt/airflow/
Note that I'm no longer copying dags to the VM, this information is going to be passed through volumes. I then build the docker file via docker build -t learning/airflow .. My docker-compose.yaml looks like this:
version: "3"
services:
postgres:
image: "postgres:9.6"
container_name: "postgres"
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5432:5432"
volumes:
- ./data/postgres:/var/lib/postgresql/data
initdb:
image: learning/airflow
entrypoint: airflow initdb
depends_on:
- postgres
webserver:
image: learning/airflow
restart: always
entrypoint: airflow webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
ports:
- "8080:8080"
depends_on:
- postgres
volumes:
- ./airflow/dags:/opt/airflow/dags
- ./airflow/plugins:/opt/airflow/plugins
- ./data/logs:/opt/airflow/logs
scheduler:
image: learning/airflow
restart: always
entrypoint: airflow scheduler
healthcheck:
test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-scheduler.pid ]"]
interval: 30s
timeout: 30s
retries: 3
depends_on:
- postgres
volumes:
- ./airflow/dags:/opt/airflow/dags
- ./airflow/plugins:/opt/airflow/plugins
- ./data/logs:/opt/airflow/logs
To use it, first run docker-compose up postgres, then docker-compose up initdb and then docker-compose up webserver scheduler. That's it!
spinning up two docker containers alone may not achieve your goal, as you would need communications between containers. You can manually set up a docker network between your containers, although I haven't tried this approach personally.
An easier way is to use docker-compose, which you can define your resources in a yml file, and let docker-compose create them for you.
version: '2.1'
services:
webserver:
image: puckel/docker-airflow:1.10.4
restart: always
...
scheduler:
image: puckel/docker-airflow:1.10.4
restart: always
depends_on:
- webserver
...
You can find the complete file here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With