I have successfully, locally developed a super simple ETL process (called load_staging below) which extracts data from some remote location and then writes that unprocessed data into a MongoDB container on my local Windows machine. Now, I want to schedule this process with Apache-Airflow using the DockerOperator for every task, i.e. I want to create a docker image of my source code and then execute the source code in that image using the DockerOperator. Since I am working on a windows machine, I can only use Airflow from inside a docker container.
I have started the airflow container (called webserver below) and the MongoDB container (called mongo below) with docker-compose up and I manually triggered the DAG in Airflow's GUI. According to Airflow, the task is being successfully executed, but it seems that the code inside the docker image is not being executed, because the task finishes way too soon and right after the docker container is started from my image, the task executes with error code 0, i.e. I don't see any logging output from the task itself. See logs below:
[2020-01-20 17:09:44,444] {{docker_operator.py:194}} INFO - Starting docker container from image myaccount/myrepo:load_staging_op
[2020-01-20 17:09:50,473] {{logging_mixin.py:95}} INFO - [[34m2020-01-20 17:09:50,472[0m] {{[34mlocal_task_job.py:[0m105}} INFO[0m - Task exited with return code 0[0m
So, my two questions are:
Below you can find further information about how I set up the DockerOperator, how I define the image that is supposed to be executed by the DockerOperator, the docker-compose.yml file starting the webserver and mongo containers and the Dockerfile used to create the webserver container.
In my DAG definition file, I specified the DockerOperator like so:
CONFIG_FILEPATH = "/configs/docker_execution.ini"
data_object_name = "some_name"
task_id_ = "{}_task".format(data_object_name)
cmd = "python /src/etl/load_staging_op/main.py --config_filepath={} --data_object_name={}".format(CONFIG_FILEPATH, data_object_name)
staging_op = DockerOperator(
command=cmd,
task_id=task_id_,
image="myaccount/myrepo:load_staging_op",
api_version="auto",
auto_remove=True
)
The Dockerfile for the image load_staging_op referenced above looks as follows:
# Inherit from Python image
FROM python:3.7
# Install environment
USER root
COPY ./src/etl/load_staging_op/requirements.txt ./
RUN pip install -r requirements.txt
# Copy source code files into container
COPY ./configs /configs
COPY ./wsdl /wsdl
COPY ./src/all_constants.py /src/all_constants.py
COPY ./src/etl/load_staging_op/utils.py /src/etl/load_staging_op/utils.py
COPY ./src/etl/load_staging_op/main.py /src/etl/load_staging_op/main.py
# Extend python path so that custom modules are found
ENV PYTHONPATH "${PYTHONPATH}:/src"
ENTRYPOINT [ "sh", "-c"]
The relevant aspects of the docker-compose.yml file looks are as follows:
version: '2.1'
services:
webserver:
build: ./docker-airflow
restart: always
privileged: true
depends_on:
- mongo
- mongo-express
volumes:
- ./docker-airflow/dags:/usr/local/airflow/dags
# source code volume
- ./src:/src
- ./docker-airflow/workdir:/home/workdir
# Mount the docker socket from the host (currently my laptop) into the webserver container
# so that we can build docker images from inside the webserver container.
- //var/run/docker.sock:/var/run/docker.sock # the two "//" are needed for windows OS
- ./configs:/configs
- ./wsdl:/wsdl
ports:
# Change port to 8081 to avoid Jupyter conflicts
- 8081:8080
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
networks:
- mynet
mongo:
container_name: mymongo
image: mongo
restart: always
ports:
- 27017:27017
networks:
- mynet
The Dockerfile for the webserver container referenced in the above Dockerfile looks as follows:
FROM puckel/docker-airflow:1.10.4
# Adds DAG folder to the PATH
ENV PYTHONPATH "${PYTHONPATH}:/src:/usr/local/airflow/dags"
# Install the optional packages
COPY requirements.txt requirements.txt # make sure something like docker==4.1.0 is in this requirements.txt file!
USER root
RUN pip install -r requirements.txt
# Install docker inside the webserver container
RUN curl -sSL https://get.docker.com/ | sh
ENV SHARE_DIR /usr/local/share
# Install simple text editor for debugging
RUN ["apt-get", "update"]
RUN ["apt-get", "-y", "install", "vim"]
Thanks for your help, I highly appreciate it!
My sincere thanks to everyone who took the time to help me with my problem. I needed to implement the following changes to make it work:
DockerOperator:
network_mode with the network the webserver container is running in. This was difficult for me, as I am new to Docker and couldn't find much tutorials about this online. To find the network's name the webserver container is running in, I listed all currently active networks on my host (=windows laptop) using something like docker network ls. In the list of displayed networks I saw a network that was called something like project_root_dirname_mynet, so a combination of my project's root directory and the network name specified in the docker-compose.yml file. Interestingly (and obviously then), after listing all networks, you can inspect the network project_root_dirname_mynet using something like docker network inspect project_root_dirname_mynet. This will return a json file with a subsection "containers", in which you can see all containers specified in your docker-compose.yml file. The code for the DockerOperator then becomes:
cmd = "--config_filepath {} --data_object_name {}".format(CONFIG_FILEPATH.strip(), data_object_name.strip())
print("Command: {}".format(cmd))
staging_op = DockerOperator(
command=cmd,
task_id=task_id_,
image="myaccount/myrepo:load_staging_op",
api_version="auto",
auto_remove=True,
network_mode="project_root_dirname_mynet"
)
Dockerfile of the load_staging_op task:
ENTRYPOINT [ "sh", "-c"] to ENTRYPOINT [ "python", "/src/etl/load_staging_op/main.py"]. I think the "python" argument will open a Python console in the container and the second argument is just the path to the script you want to execute inside the docker container. Then, at run time (or build time or however this is called), the command line arguments from cmd above will be passed on. In the source code of the image, you can then use a library like argparse to retrieve these commands. If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With