My question is very basic, My code is working fine. But I am not clear with these two points:
from pyspark.sql import SparkSession,SQLContext
from pyspark.conf import SparkConf
spark = SparkSession \
.builder \
.enableHiveSupport() \
.appName("test") \
.getOrCreate()
print(spark)
sqlContext = SQLContext(spark)
or i can directly access spark session object in my script with out creating it.
from pyspark.sql import SparkSession,SQLContext
from pyspark.conf import SparkConf
print(spark) -- this can be ***sc*** not sure I am using spark-2
sqlContext = SQLContext(spark)
and if spark session object is available then how i can add config properties such as below or how to enable hive support.
spark = SparkSession \
.builder \
.enableHiveSupport() \
.config(conf=SparkConf().set("spark.driver.maxResultSize", "2g")) \
.appName("test") \
.getOrCreate()
My doubt is if i submit job using spark-submit and creating spark session object as mentioned above am i ending up creating two spark session ?
It would be very helpful if someone can explain me added advantage of using spark-submit over step 2 method. And do i need to create spark-session object if i invoke job using spark-submit from command line
When we submit any pySpark job using spark-submit do we need to create spark session object?
Yes, It is not needed only in case of shells.
My doubt is if i submit job using spark-submit and creating spark session object as mentioned above am I ending up creating two spark session ?
If we check the code you have written
spark = SparkSession \
.builder \
.enableHiveSupport() \
.config(conf=SparkConf().set("spark.driver.maxResultSize", "2g")) \
.appName("test") \
.getOrCreate()
Observe getOrCreate()
, it will take care of at any time only one SparkSession Object (spark
) exists.
I would recommend to create the context/session in local and makes code pure(as not depending on other our sources for object).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With