Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join or Group By VARIANT column: a good practice?

We need to deduplicate a recordset (hundreds of millions of rows) in Snowflake and the engine allows you to group by or join using the variant column directly, but (obviously) it's a resource consuming process. The problem is that the schema of JSON content of the VARIANT column can change without prior notice so we can't simply extract the required fields (or all fields) for the deduplication SQL statements (which is much much faster).

Does anybody know if executing a join or group by using a VARIANT column is inherently wrong? or is it possible that it would lead to wrong results?

Regards, Babak.

like image 850
Babak Tourani Avatar asked Sep 14 '25 02:09

Babak Tourani


1 Answers

There's nothing "inherently" wrong that I know of here. It won't scale well to large tables, or very large variant columns though.

When data is parsed into a variant, SnowFlake does do some processing on them for the purpose of indexing, null handling, and performance. Notably, Variant nulls, the null value inside of the JSON, can be compared as equal to itself, unlike SQL NULLS. Also, data types like numbers and dates are stored as strings when they are within a variant, and use string equality. So if your data sources handle data types differently you might be able to see a scenario where 2020-01-01 12:00:00.00 is treated as not equal to 2020-01-01 12:00:00, but I haven't tested this.

This processing is why you noticed (in a separate comment) that {"a":1,"b":2} is stored the same way as {"b": 2, "a": 1} and thus they are "equal to" each other. So that could technically count as a "false positive", joining when you don't expect them to be equal. But as far as I know, this processing is consistent and you shouldn't get false negatives.

like image 125
David Garrison Avatar answered Sep 16 '25 20:09

David Garrison