Trying to flatten input JSON data having two map/dictionary fields (custom_event1 and custom_event2), which may contain any key-value pair data. In order to create an output table from the data frame, will have to avoid the flattening of custom_events and store it as JSON string in the column.
Following this doc, Relationalize.apply is flattening the custom_events map also.
Sample JSON:
{
"id": "sklfsdfskdlfsdfsdfkhsdfssdf",
"idtype": "cookieId",
"event": "install",
"sub_event": null,
"ip": "XXXXXX",
"geo": {
"country": "IN",
"city": null,
"region": null
},
"carrier": {
"operator": null,
"network": null,
"connection_type": null
},
"user_agent": "Mozilla/5.0",
"device": {
"brand": "LYF",
"model": null,
"type": null
},
"package": {
"pkgName": "XXXXXXXX",
"pkgVersion": "1.5.6.3",
"pkgRating": null,
"timestamp": "2017-12-14 11:51:27"
},
"custom_event1": {
"key1": "value1",
"key2": "value2"
},
"custom_event2": {
"key": "value"
}
}
How to store JSON data with a dynamic map field in a Relational storage?
The steps that you would need, assumption that JSON data is in S3
Create a Crawler in AWS Glue and let it create a schema in a catalog (database). Assumption is that you are familiar with AWS Glue a little.
Create a Glue job that transforms the JSON into your favorite format (parquet) that uses the transform step to flatten the data using Rationalize class - https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/ and writes to parquet format
Create a crawler for the new flatten data and create the table in aws glue
Use Athena or AWS Quick sight or your favorite BI tool to query parquet data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With