Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to relationalize a JSON to flat structure in AWS Glue

Trying to flatten input JSON data having two map/dictionary fields (custom_event1 and custom_event2), which may contain any key-value pair data. In order to create an output table from the data frame, will have to avoid the flattening of custom_events and store it as JSON string in the column.

Following this doc, Relationalize.apply is flattening the custom_events map also.

Sample JSON:

{
    "id": "sklfsdfskdlfsdfsdfkhsdfssdf",
    "idtype": "cookieId",
    "event": "install",
    "sub_event": null,
    "ip": "XXXXXX",
    "geo": {
        "country": "IN",
        "city": null,
        "region": null
    },
    "carrier": {
        "operator": null,
        "network": null,
        "connection_type": null
    },
    "user_agent": "Mozilla/5.0",
    "device": {
        "brand": "LYF",
        "model": null,
        "type": null
    },
    "package": {
        "pkgName": "XXXXXXXX",
        "pkgVersion": "1.5.6.3",
        "pkgRating": null,
        "timestamp": "2017-12-14 11:51:27"
    },
    "custom_event1": {
        "key1": "value1",
        "key2": "value2"
    },
    "custom_event2": {
        "key": "value"
    }
}

How to store JSON data with a dynamic map field in a Relational storage?

like image 253
Sumit Saurabh Avatar asked Sep 15 '25 01:09

Sumit Saurabh


1 Answers

The steps that you would need, assumption that JSON data is in S3

  1. Create a Crawler in AWS Glue and let it create a schema in a catalog (database). Assumption is that you are familiar with AWS Glue a little.

  2. Create a Glue job that transforms the JSON into your favorite format (parquet) that uses the transform step to flatten the data using Rationalize class - https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/ and writes to parquet format

  3. Create a crawler for the new flatten data and create the table in aws glue

  4. Use Athena or AWS Quick sight or your favorite BI tool to query parquet data

like image 103
Haroon Avatar answered Sep 16 '25 15:09

Haroon