Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to understand each part of the name of a parquet file

Case: part-00000-deb4a3d4-d8c3-4983-8756-ad7e0b29e780.c000.snappy.parquet

I can't find some rules of a parquet file in the code. could someone explain?

code: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

like image 642
jay Wong Avatar asked Oct 17 '25 02:10

jay Wong


1 Answers

In this case:

part-00000 signifies Split (of a) Partition number.

-deb4a3d4-d8c3-4983-8756-ad7e0b29e780 signifies random UUID to allow concurrent write processes in Spark Actions that do not conflict.

"c000" signifies a counter indicating the number of times a file has been written for this partition. here is it is zero and it counts on. Not sure what happens if 999 exceeded, to be honest.

like image 157
thebluephantom Avatar answered Oct 19 '25 08:10

thebluephantom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!