Spark 2.4.4:
I want to import a CSV file, but there are two options. Why is that? And which one is better? Which one should I use?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[2]") \
.config('spark.cores.max', '3') \
.config('spark.executor.memory', '2g') \
.config('spark.executor.cores', '2') \
.config('spark.driver.memory','1g') \
.getOrCreate()
df = spark.read \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("data/myfile.csv")
df = spark.read.load("data/myfile.csv", format="csv", inferSchema="true", header="true")
As of Spark 2, com.databricks.spark.csv isn't necessary to write out completely since the CSV reader is included. Therefore option 2 would be preferred.
Or slightly shorter,
spark.read.csv("data/myfile.csv", inferSchema=True, header=True)
But option 2 would be better if you extracted the input format into some configuration file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With