I have setup an EMR cluster with Data Catalog enabled

I can access the data catalog when I use Zeppelin, but with jobs/steps I submit like:
aws emr add-steps --cluster-id j-XXXXXX --steps "Type=spark,Name=Test,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,2,--executor-cores,2,--executor-memory,8g,s3://XXXXXX/emr-test.py],ActionOnFailure=CONTINUE"
I cannot see my data catalog when I use spark.sql("USE xxx") OR spark.sql("SHOW DATABASES") why is that.
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession \
.builder \
.appName("Test") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.getOrCreate()
spark.sql("USE ...")
spark.sql("SHOW TABLES").show()
spark.sql("SELECT querydatetime FROM flights LIMIT 10").show(10)
sc.stop()
I get something like:
pyspark.sql.utils.AnalysisException: u"Database 'xxxxxx' not found;"
I found out from https://michael.ransley.co/2018/08/28/spark-glue.html that
To access the tables from within a Spark step you need to instantiate the spark session with the glue catalog:
spark = SparkSession.builder \ .appName(job_name) \ .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \ .enableHiveSupport() \ .getOrCreate() spark.catalog.setCurrentDatabase("mydatabase")
I am missing the line .enableHiveSupport(). Its quite unfortunate this does not seem to be documented in the official docs ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With