I want to understand how spark determines the number of csv files it creates while saving a data frame as csv file. Does the number of partitions affect this number? and why are some empty files created? I have the code like follows
dataframe.coalesce(numPartitions).write
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.option("header", "true")
.mode("overwrite")
.save("outputpath")
There are multiples files when you save in csv or any other format, Its because of a multiple number of the partition of your dataframe. If you have n number of partition then you get n number of files saved in output.
Does the number of partitions affect this number?
Yes, the number of partition is equal to the number of files. While saviong the datarfame/rdd each partition is written as a single file.
why are some empty files created?
All the partitions may not contain data
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With