Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parquet Binary Data type

Tags:

parquet

impala

I have a question regarding the Binary data type. I am trying to write a Parquet Schema for my MR job to create the Parquet file contrary to have Hive or Impala create one. I see some references to a Binary type which I do not see in Parquet

Is binary an alias to BYTE_ARRAY?

Also is UTF-8 a default encoding on Binary data types?

like image 745
user1971133 Avatar asked Oct 29 '25 00:10

user1971133


2 Answers

Raw bytes are stored in Parquet either as a fixed-length byte array (FIXED_LEN_BYTE_ARRAY) or as a variable-length byte array (BYTE_ARRAY, also called binary). Fixed is used when you have values with a constant size, like a SHA1 hash value. Most of the time, the variable-length version is used.

Strings are encoded as variable-length binary with the UTF8 type annotation to indicate how to interpret the raw bytes back into a String. UTF8 is the only encoding supported in the format, but not every binary uses UTF8 because not all binary fields are storing string data.

like image 116
blue Avatar answered Oct 31 '25 22:10

blue


There is no data type in parquet-column called BYTE_ARRAY. I saw their PrimitiveType in latest package but could not see it. Could not write byte[] in binary as well.

like image 37
Mayank Thirani Avatar answered Oct 31 '25 21:10

Mayank Thirani



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!