Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use DockerOperator from apache airflow on windows host

I have successfully, locally developed a super simple ETL process (called load_staging below) which extracts data from some remote location and then writes that unprocessed data into a MongoDB container on my local Windows machine. Now, I want to schedule this process with Apache-Airflow using the DockerOperator for every task, i.e. I want to create a docker image of my source code and then execute the source code in that image using the DockerOperator. Since I am working on a windows machine, I can only use Airflow from inside a docker container.

I have started the airflow container (called webserver below) and the MongoDB container (called mongo below) with docker-compose up and I manually triggered the DAG in Airflow's GUI. According to Airflow, the task is being successfully executed, but it seems that the code inside the docker image is not being executed, because the task finishes way too soon and right after the docker container is started from my image, the task executes with error code 0, i.e. I don't see any logging output from the task itself. See logs below:

[2020-01-20 17:09:44,444] {{docker_operator.py:194}} INFO - Starting docker container from image myaccount/myrepo:load_staging_op
[2020-01-20 17:09:50,473] {{logging_mixin.py:95}} INFO - [[34m2020-01-20 17:09:50,472[0m] {{[34mlocal_task_job.py:[0m105}} INFO[0m - Task exited with return code 0[0m

So, my two questions are:

  1. Did I come to the correct conclusion or what else could be the root of this problem?
  2. How to make sure that the code inside the image is always executed?

Below you can find further information about how I set up the DockerOperator, how I define the image that is supposed to be executed by the DockerOperator, the docker-compose.yml file starting the webserver and mongo containers and the Dockerfile used to create the webserver container.

In my DAG definition file, I specified the DockerOperator like so:

CONFIG_FILEPATH = "/configs/docker_execution.ini"
data_object_name = "some_name"
task_id_ = "{}_task".format(data_object_name)
cmd = "python /src/etl/load_staging_op/main.py --config_filepath={} --data_object_name={}".format(CONFIG_FILEPATH, data_object_name)
staging_op = DockerOperator(
            command=cmd,
            task_id=task_id_,
            image="myaccount/myrepo:load_staging_op",
            api_version="auto",
            auto_remove=True
)

The Dockerfile for the image load_staging_op referenced above looks as follows:

# Inherit from Python image
FROM python:3.7

# Install environment
USER root
COPY ./src/etl/load_staging_op/requirements.txt ./
RUN pip install -r requirements.txt

# Copy source code files into container
COPY ./configs /configs
COPY ./wsdl /wsdl
COPY ./src/all_constants.py /src/all_constants.py
COPY ./src/etl/load_staging_op/utils.py /src/etl/load_staging_op/utils.py
COPY ./src/etl/load_staging_op/main.py /src/etl/load_staging_op/main.py

# Extend python path so that custom modules are found
ENV PYTHONPATH "${PYTHONPATH}:/src"

ENTRYPOINT [ "sh", "-c"]

The relevant aspects of the docker-compose.yml file looks are as follows:

version: '2.1'
services:
    webserver:
        build: ./docker-airflow
        restart: always
        privileged: true
        depends_on:
            - mongo
            - mongo-express
        volumes:
            - ./docker-airflow/dags:/usr/local/airflow/dags
            # source code volume
            - ./src:/src
            - ./docker-airflow/workdir:/home/workdir
            # Mount the docker socket from the host (currently my laptop) into the webserver container
            # so that we can build docker images from inside the webserver container.
            - //var/run/docker.sock:/var/run/docker.sock  # the two "//" are needed for windows OS
            - ./configs:/configs
            - ./wsdl:/wsdl
        ports:
            # Change port to 8081 to avoid Jupyter conflicts
            - 8081:8080
        command: webserver
        healthcheck:
            test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
            interval: 30s
            timeout: 30s
            retries: 3
        networks:
            - mynet

    mongo:
        container_name: mymongo
        image: mongo
        restart: always
        ports:
            - 27017:27017
        networks:
            - mynet

The Dockerfile for the webserver container referenced in the above Dockerfile looks as follows:

FROM puckel/docker-airflow:1.10.4

# Adds DAG folder to the PATH
ENV PYTHONPATH "${PYTHONPATH}:/src:/usr/local/airflow/dags"

# Install the optional packages
COPY requirements.txt requirements.txt  # make sure something like docker==4.1.0 is in this requirements.txt file!
USER root
RUN pip install -r requirements.txt

# Install docker inside the webserver container
RUN curl -sSL https://get.docker.com/ | sh
ENV SHARE_DIR /usr/local/share

# Install simple text editor for debugging
RUN ["apt-get", "update"]
RUN ["apt-get", "-y", "install", "vim"]

Thanks for your help, I highly appreciate it!

like image 524
Kevin Südmersen Avatar asked Oct 26 '25 07:10

Kevin Südmersen


1 Answers

My sincere thanks to everyone who took the time to help me with my problem. I needed to implement the following changes to make it work:

DockerOperator:

  • Adjust the command passed to the container at run time, i.e. when the container is built
  • Add the parameter network_mode with the network the webserver container is running in. This was difficult for me, as I am new to Docker and couldn't find much tutorials about this online. To find the network's name the webserver container is running in, I listed all currently active networks on my host (=windows laptop) using something like docker network ls. In the list of displayed networks I saw a network that was called something like project_root_dirname_mynet, so a combination of my project's root directory and the network name specified in the docker-compose.yml file. Interestingly (and obviously then), after listing all networks, you can inspect the network project_root_dirname_mynet using something like docker network inspect project_root_dirname_mynet. This will return a json file with a subsection "containers", in which you can see all containers specified in your docker-compose.yml file.

The code for the DockerOperator then becomes:

cmd = "--config_filepath {} --data_object_name {}".format(CONFIG_FILEPATH.strip(), data_object_name.strip())
print("Command: {}".format(cmd))
staging_op = DockerOperator(
    command=cmd,
    task_id=task_id_,
    image="myaccount/myrepo:load_staging_op",
    api_version="auto",
    auto_remove=True,
    network_mode="project_root_dirname_mynet"
)

Dockerfile of the load_staging_op task:

  • Change the last line from ENTRYPOINT [ "sh", "-c"] to ENTRYPOINT [ "python", "/src/etl/load_staging_op/main.py"]. I think the "python" argument will open a Python console in the container and the second argument is just the path to the script you want to execute inside the docker container. Then, at run time (or build time or however this is called), the command line arguments from cmd above will be passed on. In the source code of the image, you can then use a library like argparse to retrieve these commands.
like image 142
Kevin Südmersen Avatar answered Oct 28 '25 22:10

Kevin Südmersen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!