spark-submit --packages is not working on my cluster what could be the reason?

Question

I am trying to run my spark sample postgress database read in a spark application.I gave the spark command line arguments as spark-submit --packages org.postgresql:postgresql:9.3-1101.jdbc41.jar , but I am still getting the class not found exception. Can you please help in solving my issue ?

Dhwani Katagade · Accepted Answer

A similar question is posted here spark-submit classpath issue with --repositories --packages options

I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.

The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.

In my case I had observed that

spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars

I found the following discussion useful but I still have to nail down this problem. https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455

Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.

But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

spark-submit --packages is not working on my cluster what could be the reason?

Tags:

maven

scala

apache-spark

Sandy

1 Answers

Dhwani Katagade

Recent Activity

Donate For Us

spark-submit --packages is not working on my cluster what could be the reason?

Tags:

maven

scala

apache-spark

Sandy

1 Answers

Dhwani Katagade

Related questions

Recent Activity

Donate For Us