Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Spark creates multiple csv files while saving a dataframe in csv format?

I want to understand how spark determines the number of csv files it creates while saving a data frame as csv file. Does the number of partitions affect this number? and why are some empty files created? I have the code like follows

dataframe.coalesce(numPartitions).write
   .format("com.databricks.spark.csv")
   .option("delimiter", "|")
   .option("header", "true")
   .mode("overwrite")
   .save("outputpath")
like image 481
user3104078 Avatar asked Mar 23 '26 06:03

user3104078


1 Answers

There are multiples files when you save in csv or any other format, Its because of a multiple number of the partition of your dataframe. If you have n number of partition then you get n number of files saved in output.

Does the number of partitions affect this number?

Yes, the number of partition is equal to the number of files. While saviong the datarfame/rdd each partition is written as a single file.

why are some empty files created?

All the partitions may not contain data

Hope this helps!

like image 179
koiralo Avatar answered Mar 26 '26 01:03

koiralo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!