Is there any performance benefit resulting from the usage of using nested data types in the Parquet file format?
AFAIK Parquet files are usually created specifically for query services e.g. Athena, so the process which creates those might as well simply flatten the values - thereby allowing easier querying, simpler schema, and retaining the column statistics for each column.
What benefit is there to be gained by using nested data types e.g. struct?
There is a negative consequence keeping nested structure in parquet. The issue is spark predicate pushdown doesn't work properly if you have nested structure in the parquet file.
So even if you are working with few fields in your parquet dataset spark will load and materialize the entire dataset.
Here is the ticket which is opened for a long time regarding this issue.
EDIT
The issue has been resolved in spark 2.4 version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With