Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with jdbc jar in pyspark

I need to read from a postgres sql database in pyspark. I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually.

I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches:

pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar

Inside pyspark I did:

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()

However, while using --jars and --driver-class-path worked fine for jars I created, it failed for jdbc and I got an exception from the workers:

 java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver

If I copy the jar manually to all workers and add --conf spark.executor.extraClassPath and --conf spark.driver.extraClassPath, it does work (with the same jar). The documentation btw suggests using SPARK_CLASSPATH which is deprecated actually adds these two switches (but has the side effect of preventing adding OTHER jars with the --jars option which I need to do)

So my question is: what is special about the jdbc driver which makes it not work and how can I add it without having to manually copy it to all workers.

Update:

I did some more looking and found this in the documentation: "The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.".

The problem is I can't seem to find computer_classpath.sh nor do I understand what the primordial class loader means.

I did find this which basically explains that this needs to be done locally. I also found this which basically says there is a fix but it is not yet available in version 1.6.1.

like image 839
Assaf Mendelson Avatar asked Sep 01 '25 00:09

Assaf Mendelson


1 Answers

I found a solution which works (Don't know if it is the best one so feel free to continue commenting). Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()

Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:

    <dependency>
      <groupId>org.postgresql</groupId>
      <artifactId>postgresql</artifactId>
      <version>9.4.1208</version>
    </dependency>

and then you don't have to add the driver as a separate jar, just use the jar with dependencies.

like image 149
Assaf Mendelson Avatar answered Sep 02 '25 16:09

Assaf Mendelson