I want to convert my input file (xml/json) to parquet. I have already have one solution that works with spark, and creates required parquet file.
However, due to other client requirements, i might need to create a solution that does not involve hadoop eco system such as hive, impala, spark or mapreduce.
And, Kite SDK is using .avsc file to create parquet data, kindly correct me if i am wrong. I might be short sighted but, looks like it needs avro schema file. So, is there any library that can create parquet file from self explanatory files such as xml or json.?
Note: If it feels like not a proper approach, i would like to understand the reasons why it is not a recommended approach, so that i can earn some knowledge or understand the areas that i might have missed.
I just published one using python.
https://github.com/blackrock/xml_to_parquet
Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started.
It requires a XSD schema file to convert everything in your XML file into an equivalent parquet file with nested data structures that match XML paths.
Convert a small XML file to a Parquet file
python xml_to_parquet.py -x PurchaseOrder.xsd PurchaseOrder.xml
INFO - 2021-01-21 12:32:38 - Parsing XML Files..
INFO - 2021-01-21 12:32:38 - Processing 1 files
DEBUG - 2021-01-21 12:32:38 - Generating schema from PurchaseOrder.xsd
DEBUG - 2021-01-21 12:32:38 - Parsing PurchaseOrder.xml
DEBUG - 2021-01-21 12:32:38 - Saving to file PurchaseOrder.xml.parquet
DEBUG - 2021-01-21 12:32:38 - Completed PurchaseOrder.xml
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With