I am (very) new to AWS and Spark in general, and I'm trying to run a notebook instance in Amazon EMR. When I try to import pyspark to start a session and load data from s3, I get the error No module named 'pyspark'. The cluster I created had the Spark option filled, what am I doing wrong?
The only solution that worked for me was to change the notebook kernel to the PySpark kernel, then changing the bootstrap action to install packages (in python version3.6) that are not by default in the pyspark kernel:
#!/bin/bash
sudo python3.6 -m pip install numpy \
matplotlib \
pandas \
seaborn \
pyspark
Apparently by default it installs to python 2.7.16, so it outputs no error message but you can't import the modules because the spark env uses Python 2.7.16.
You can open jupyter lab notebook and select new spark notebook from there. This will initiate the spark context automatically for you.
Or you can open Jupyter notebook and load spark app by %%spark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With