Why Spark creates multiple csv files while saving a dataframe in csv format?

Question

I want to understand how spark determines the number of csv files it creates while saving a data frame as csv file. Does the number of partitions affect this number? and why are some empty files created? I have the code like follows

dataframe.coalesce(numPartitions).write
   .format("com.databricks.spark.csv")
   .option("delimiter", "|")
   .option("header", "true")
   .mode("overwrite")
   .save("outputpath")

koiralo · Accepted Answer

There are multiples files when you save in csv or any other format, Its because of a multiple number of the partition of your dataframe. If you have n number of partition then you get n number of files saved in output.

Does the number of partitions affect this number?

Yes, the number of partition is equal to the number of files. While saviong the datarfame/rdd each partition is written as a single file.

why are some empty files created?

All the partitions may not contain data

Hope this helps!

Why Spark creates multiple csv files while saving a dataframe in csv format?

Tags:

csv

scala

apache-spark

apache-spark-sql

user3104078

1 Answers

koiralo

Recent Activity

Donate For Us

Why Spark creates multiple csv files while saving a dataframe in csv format?

Tags:

csv

scala

apache-spark

apache-spark-sql

user3104078

1 Answers

koiralo

Related questions

Recent Activity

Donate For Us