Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to allow pyspark to run code on emr cluster

We use python with pyspark api in order to run simple code on spark cluster.

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('appName').setMaster('spark://clusterip:7077')
sc = SparkContext(conf=conf)

rdd = sc.parallelize([1, 2, 3, 4])
rdd.map(lambda x: x**2).collect()

It works when we setup a spark cluster locally and with dockers.

We would now like to start an emr cluster and test the same code. And seems that pyspark can't connect to the spark cluster on emr

We opened ports 8080 and 7077 from our machine to the spark master

We are getting past the firewall and just seems that nothing is listening on port 7077 and we get connection refused.

We found this explaining how to serve a job using the cli but we need to run it directly from pyspark api on the driver.

What are we missing here?

How can one start an emr cluster and actually run pyspark code locally on python using this cluster?

edit: running this code from the master itself works As opposed to what was suggested, when connecting to the master using ssh, and running python from the terminal, the very same code (with proper adjustments for the master ip, given it's the same machine) works. No issues no problems. How does this make sense given the documentation that clearly states otherwise?

like image 323
thebeancounter Avatar asked Oct 15 '25 20:10

thebeancounter


1 Answers

You try to run pyspark (which calls spark-submit) form a remote computer outside the spark cluster. This is technically possible but it is not the intended way of deploying applications. In yarn mode, it will make your computer participate in the spark protocol as a client. Thus it would require opening several ports and installing exactly the same spark jars as on spark aws emr.

Form the spark submit doc :

 A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster)

A simple deploy strategy is

  • sync code to master node via rsync, scp or git
cd ~/projects/spark-jobs # on local machine
EMR_MASTER_IP='255.16.17.13'
TARGET_DIR=spark_jobs
rsync -avze "ssh -i ~/dataScienceKey.pem" --rsync-path="mkdir -p ${TARGET_DIR} && rsync" --delete ./ hadoop@${EMR_MASTER_IP}:${TARGET_DIR}
  • ssh to the master node
ssh -i ~/dataScienceKey.pem hadoop@${EMR_HOST}
  • run spark-submit on the master node
cd spark_jobs
spark-submit --master yarn --deploy-mode cluster my-job.py
# my-job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my-job-py").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4])
res = rdd.map(lambda x: x**2).collect()
print(res)

There is a way to submit the job directly to spark emr without syncing. Spark EMR runs Apache Livy on port 8998 by default. It is a rest webservice which allows to submit jobs via a rest api. You can pass the same spark-submit parameters with a curl script from your machine. See doc

For interactive development we have also configured local running jupyter notebooks which automatically submit cell runs to livy. This is done via the spark-magic project

like image 143
dre-hh Avatar answered Oct 17 '25 16:10

dre-hh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!