Why are there two options to read a CSV file in PySpark? Which one should I use?

Question

Spark 2.4.4:

I want to import a CSV file, but there are two options. Why is that? And which one is better? Which one should I use?

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .config('spark.cores.max', '3') \
    .config('spark.executor.memory', '2g') \
    .config('spark.executor.cores', '2') \
    .config('spark.driver.memory','1g') \
    .getOrCreate()

Option 1

df = spark.read \
    .format("com.databricks.spark.csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("data/myfile.csv")

Option 2

df = spark.read.load("data/myfile.csv", format="csv", inferSchema="true", header="true")

OneCricketeer · Accepted Answer

As of Spark 2, com.databricks.spark.csv isn't necessary to write out completely since the CSV reader is included. Therefore option 2 would be preferred.

Or slightly shorter,

spark.read.csv("data/myfile.csv", inferSchema=True, header=True)

But option 2 would be better if you extracted the input format into some configuration file

Why are there two options to read a CSV file in PySpark? Which one should I use?

Tags:

python

apache-spark

pyspark

apache-spark-2.0

Option 1

Option 2

phez1

1 Answers

OneCricketeer

Recent Activity

Donate For Us

Why are there two options to read a CSV file in PySpark? Which one should I use?

Tags:

python

apache-spark

pyspark

apache-spark-2.0

Option 1

Option 2

phez1

1 Answers

OneCricketeer

Related questions

Recent Activity

Donate For Us