Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redshift Spectrum: Query Anonymous JSON array structure

I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.

[{"key":"value"}, {"key":"value"}]

I'm using the custom Classifier:

$[*] 

When trying to query from Spectrum, however, it returns:

Top level Ion/JSON structure must be an anonymous array if and only if serde property 'strip.outer.array' is set. Mismatch occured in file...

I set that serde property manually in the Glue catalog table, but nothing changed.

Is it no possible to query an anonymous array via Spectrum?

like image 830
comfytoday Avatar asked Jun 23 '26 14:06

comfytoday


2 Answers

Naming the array in the JSON file like this:

"values":[{"key":"value"},...}

And updating the classifier:

$.values[*]

Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.

Update: In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:

{"key":"value"}
{"key":"value"}
etc.

It does seem to be a Spectrum specific issue, as Athena would still work.

Interested to know if anyone else was able to get it to work...

like image 171
comfytoday Avatar answered Jun 25 '26 05:06

comfytoday


I've successfully done this, but without a data classifier. My JSON file looks like:

[
    {
        "col1": "data_from_col1",
        "col2": "data_from_col2",
        "col3": [
            {
                "col4": "data_from_col4",
                ...
            {
        ]
    },
    {
        "col1": "data_from_col1",
        "col2": "data_from_col2",
        "col3": [
            {
                "col4": "data_from_col4",
                ...
            {
        ]
    },
    ...
]

I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.

I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:

anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string

I manually updated my schema to:

col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string

(And also add the serde param strip.outer.array.)

Then I had to rerun my crawler, and finally I could query in Spectrum like:

select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;
like image 39
MargaretT Avatar answered Jun 25 '26 07:06

MargaretT



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!