Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No module named 'pyspark' when running Jupyter notebook inside EMR

I am (very) new to AWS and Spark in general, and I'm trying to run a notebook instance in Amazon EMR. When I try to import pyspark to start a session and load data from s3, I get the error No module named 'pyspark'. The cluster I created had the Spark option filled, what am I doing wrong?

like image 382
RafaJM Avatar asked Sep 06 '25 03:09

RafaJM


2 Answers

The only solution that worked for me was to change the notebook kernel to the PySpark kernel, then changing the bootstrap action to install packages (in python version3.6) that are not by default in the pyspark kernel:

#!/bin/bash
sudo python3.6 -m pip install numpy \
    matplotlib \
    pandas \
    seaborn \
    pyspark

Apparently by default it installs to python 2.7.16, so it outputs no error message but you can't import the modules because the spark env uses Python 2.7.16.

like image 160
RafaJM Avatar answered Sep 07 '25 23:09

RafaJM


You can open jupyter lab notebook and select new spark notebook from there. This will initiate the spark context automatically for you.

enter image description here

Or you can open Jupyter notebook and load spark app by %%spark

enter image description here

like image 25
naren Avatar answered Sep 07 '25 21:09

naren