Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark-submit --packages is not working on my cluster what could be the reason?

I am trying to run my spark sample postgress database read in a spark application.I gave the spark command line arguments as spark-submit --packages org.postgresql:postgresql:9.3-1101.jdbc41.jar , but I am still getting the class not found exception. Can you please help in solving my issue ?

like image 387
Sandy Avatar asked Oct 28 '25 05:10

Sandy


1 Answers

A similar question is posted here spark-submit classpath issue with --repositories --packages options

I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.

  • The --repositories option is required only if some custom repository has to be referenced
  • By default the maven central repository is used if the --repositories option is not provided
  • When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
  • If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.

In my case I had observed that

  • spark-shell worked perfectly with the --packages option
  • spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
  • spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
  • This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars

I found the following discussion useful but I still have to nail down this problem. https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455

Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.

But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

like image 160
Dhwani Katagade Avatar answered Oct 29 '25 20:10

Dhwani Katagade