No module named 'pyspark' when running Jupyter notebook inside EMR

Question

I am (very) new to AWS and Spark in general, and I'm trying to run a notebook instance in Amazon EMR. When I try to import pyspark to start a session and load data from s3, I get the error No module named 'pyspark'. The cluster I created had the Spark option filled, what am I doing wrong?

RafaJM · Accepted Answer

The only solution that worked for me was to change the notebook kernel to the PySpark kernel, then changing the bootstrap action to install packages (in python version3.6) that are not by default in the pyspark kernel:

#!/bin/bash
sudo python3.6 -m pip install numpy \
    matplotlib \
    pandas \
    seaborn \
    pyspark

Apparently by default it installs to python 2.7.16, so it outputs no error message but you can't import the modules because the spark env uses Python 2.7.16.

naren · Answer

You can open jupyter lab notebook and select new spark notebook from there. This will initiate the spark context automatically for you.

enter image description here

Or you can open Jupyter notebook and load spark app by %%spark

enter image description here

No module named 'pyspark' when running Jupyter notebook inside EMR

Tags:

python

amazon-web-services

jupyter-notebook

pyspark

amazon-emr

RafaJM

2 Answers

RafaJM

naren

Recent Activity

Donate For Us

No module named 'pyspark' when running Jupyter notebook inside EMR

Tags:

python

amazon-web-services

jupyter-notebook

pyspark

amazon-emr

RafaJM

2 Answers

RafaJM

naren

Related questions

Recent Activity

Donate For Us