I have Amazon EMR Hadoop v2.6 cluster with Spark 1.4.1, with Yarn resource manager. I want to deploy Zeppelin on separate machine to allow turning off EMR cluster when there is no jobs running.
I tried following instruction from here https://zeppelin.incubator.apache.org/docs/install/yarn_install.html with not much of success.
Can somebody demystify steps how Zeppelin should connect to existing Yarn cluster from different machine?
Zeppelin enables data-driven, interactive data analytics and document collaboration using a number of interpreters such as Scala (with Apache Spark), Python (with Apache Spark), Spark SQL, JDBC, Markdown, Shell and so on. Zeppelin is one of the core applications supported natively by Amazon EMR.
[1] install Zeppelin with proper params:
git clone https://github.com/apache/incubator-zeppelin.git ~/zeppelin;
cd ~/zeppelin;
mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests
[2] Update EMR_MASTER EC2 security groups to accept incoming requests from all ports, to communicate with Zeppelin (should be specific port, not yet know which)
[3] Copy directory EMR_MASTER:/etc/hadoop/conf to MY_STANDALONE_SERVER:/home/zeppelin/hadoop-conf.
[4] zeppelin/conf/zeppelin-env.sh should contain:
export MASTER=yarn-client
export HADOOP_CONF_DIR=/home/zeppelin/hadoop-conf
Note: Spark parameters like spark.executor.instances are taken from Interpreter settings, is specified there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With