Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run a Python project (package) on AWS EMR serverless?

I have a Python project with several modules, classes, and dependencies files (a requirements.txt file). I want to pack it into one file with all the dependencies and give the file path to AWS EMR serverless, which will run it.

The problem is that I don't understand how to pack a Python project with all the dependencies, which file the EMR can consume, etc. All the examples I have found used one Python file.

In simple words, what should I do if my Python project is not a single file but is more complex?

like image 419
nirkov Avatar asked Oct 15 '25 02:10

nirkov


1 Answers

There's a few ways to do this with EMR Serverless. Regardless of which way you choose, you will need to provide a main entrypoint Python script to the EMR Serverless StartJobRun command.

Let's assume you've got a job structure like this where main.py is your entrypoint that creates a Spark session and runs your jobs and job1 and job2 are your local modules.

├── jobs
│   └── job1.py
│   └── job2.py
├── main.py
├── requirements.txt

Option 1. Use --py-files with your zipped local modules and --archives with a packaged virtual environment for your external dependencies

  • Zip up your job files
zip -r job_files.zip jobs
  • Create a virtual environment using venv-pack with your dependencies.

Note: This has to be done with a similar OS and Python version as EMR Serverless, so I prefer using a multi-stage Dockerfile with custom outputs.

FROM --platform=linux/amd64 amazonlinux:2 AS base

RUN yum install -y python3

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

COPY requirements.txt .

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install venv-pack==0.2.0 && \
    python3 -m pip install -r requirements.txt

RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /

If you run DOCKER_BUILDKIT=1 docker build --output . ., you should now have a pyspark_deps.tar.gz file on your local system.

  • Upload main.py, job_files.zip, and pyspark_deps.tar.gz to a location on S3.

  • Run your EMR Serverless job with a command like this (replacing APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET):

aws emr-serverless start-job-run \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<YOUR_BUCKET>/main.py",
            "sparkSubmitParameters": "--py-files s3://<YOUR_BUCKET>/job_files.zip --conf spark.archives=s3://<YOUR_BUCKET>/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
        }
    }'

Option 2. Package your local modules as a Python library and use --archives with a packaged virtual environment

This is probably the most reliable way, but it will require you to use setuptools. You can use a simple pyproject.toml file along with your existing requirements.txt

[project]
name = "mysparkjobs"
version = "0.0.1"
dynamic = ["dependencies"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}

You then can use a multi-stage Dockerfile and custom build outputs to package your modules and dependencies into a virtual environment.

Note: This requires you to enable Docker Buildkit

FROM --platform=linux/amd64 amazonlinux:2 AS base

RUN yum install -y python3

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

WORKDIR /app
COPY . .
RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install venv-pack==0.2.0 && \
    python3 -m pip install .

RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /

Now you can run DOCKER_BUILDKIT=1 docker build --output . . and a pyspark_deps.tar.gz file will be generated with all your dependencies. Upload this file and your main.py script to S3.

Assuming you uploaded both files to s3://<YOUR_BUCKET>/code/pyspark/myjob/, run the EMR Serverless job like this (replacing the APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET:

aws emr-serverless start-job-run \
    --application-id <APPLICATION_ID> \
    --execution-role-arn <JOB_ROLE_ARN> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<YOUR_BUCKET>/code/pyspark/myjob/main.py",
            "sparkSubmitParameters": "--conf spark.archives=s3://<YOUR_BUCKET>/code/pyspark/myjob/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
        }
    }'

Note the additional sparkSubmitParameters that specify your dependencies and configure the driver and executor environment variables for the proper paths to python.

like image 187
dacort Avatar answered Oct 18 '25 20:10

dacort



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!