I am trying to launch an EMR cluster that includes the Postgres driver JAR file so I can load data from Postgres and analyze it with PySpark. I have the the JAR I would like to include stored in S3. I have tried the following things:
1 - entering the following configuration:
[
  {
    "Classification": "presto-connector-postgresql",
    "Properties": {
      "connection-url": "jdbc:postgresql://example.net:5432/database",
      "connection-user": "MYUSER",
      "connection-password": "MYPASS"
    },
    "Configurations": []
  }
]
2 - Adding the JAR as a custom step (selecting the JAR from S3)
3 - Adding the JAR as a custom bootstrap action (selecting the JAR from S3)
None of these work, I can't figure out how to use the connector in Step 1 within Jupyter, and the custom step/bootstrap action both fail when I launch the cluster. How can I launch an EMR cluster with the Postgres drivers installed so I can query my data in Jupyter?
EDIT:
I used the following bootstrap script to copy the JAR to my master/worker nodes:
#!/bin/bash
aws s3 cp s3://BUCKETNAME/postgresql-42.2.8.jar /mnt1/myfolder
But still get the following error:
An error was encountered:
An error occurred while calling o90.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
With the following code:
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbcURL") \
    .option("user", "user") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .option("query", "select * from slm_files limit 100") \
    .load()
df.count()
Using this code in the first cell of my Jupyter notebook solved it for me:
%%configure -f
{ "conf":{
          "spark.jars": "s3://JAR-LOCATION/postgresql-42.2.8.jar"
         }
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With