I'm trying to start use DeltaLakes using Pyspark.
To be able to use deltalake, I invoke pyspark on Anaconda shell-prompt as —
pyspark — packages io.delta:delta-core_2.11:0.3.0
Here is the reference from deltalake — https://docs.delta.io/latest/quick-start.html
All commands for delta lake works fine from Anaconda shell-prompt.
On jupyter notebook, reference to a deltalake table gives error.Here is the code I am running on Jupyter Notebook -
df_advisorMetrics.write.mode("overwrite").format("delta").save("/DeltaLake/METRICS_F_DELTA")
spark.sql("create table METRICS_F_DELTA using delta location '/DeltaLake/METRICS_F_DELTA'")
Below is the code I am using at start of notebook to connect to pyspark -
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
Below is the error I get:
Py4JJavaError: An error occurred while calling o116.save. : java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
Any suggestions?
I have created a Google Colab/Jupyter Notebook example that shows how to run Delta Lake.
https://github.com/prasannakumar2012/spark_experiments/blob/master/examples/Delta_Lake.ipynb
It has all the steps needed to run. This uses the latest spark and delta version. Please change the versions accordingly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With