Join or Group By VARIANT column: a good practice?

Question

We need to deduplicate a recordset (hundreds of millions of rows) in Snowflake and the engine allows you to group by or join using the variant column directly, but (obviously) it's a resource consuming process. The problem is that the schema of JSON content of the VARIANT column can change without prior notice so we can't simply extract the required fields (or all fields) for the deduplication SQL statements (which is much much faster).

Does anybody know if executing a join or group by using a VARIANT column is inherently wrong? or is it possible that it would lead to wrong results?

Regards, Babak.

Does anybody know if executing a join or group by using a VARIANT column is inherently wrong? or is it possible that it would lead to wrong results?

Regards, Babak.

David Garrison · Accepted Answer

There's nothing "inherently" wrong that I know of here. It won't scale well to large tables, or very large variant columns though.

When data is parsed into a variant, SnowFlake does do some processing on them for the purpose of indexing, null handling, and performance. Notably, Variant nulls, the null value inside of the JSON, can be compared as equal to itself, unlike SQL NULLS. Also, data types like numbers and dates are stored as strings when they are within a variant, and use string equality. So if your data sources handle data types differently you might be able to see a scenario where 2020-01-01 12:00:00.00 is treated as not equal to 2020-01-01 12:00:00, but I haven't tested this.

This processing is why you noticed (in a separate comment) that {"a":1,"b":2} is stored the same way as {"b": 2, "a": 1} and thus they are "equal to" each other. So that could technically count as a "false positive", joining when you don't expect them to be equal. But as far as I know, this processing is consistent and you shouldn't get false negatives.

Join or Group By VARIANT column: a good practice?

Tags:

sql

variant

snowflake-cloud-data-platform

Babak Tourani

1 Answers

David Garrison

Recent Activity

Donate For Us

Join or Group By VARIANT column: a good practice?

Tags:

sql

variant

snowflake-cloud-data-platform

Babak Tourani

1 Answers

David Garrison

Related questions

Recent Activity

Donate For Us