We need to deduplicate a recordset (hundreds of millions of rows) in Snowflake and the engine allows you to group by
or join
using the variant column directly, but (obviously) it's a resource consuming process. The problem is that the schema of JSON content of the VARIANT
column can change without prior notice so we can't simply extract the required fields (or all fields) for the deduplication SQL statements (which is much much faster).
Does anybody know if executing a join or group by using a VARIANT column is inherently wrong? or is it possible that it would lead to wrong results?
Regards, Babak.
There's nothing "inherently" wrong that I know of here. It won't scale well to large tables, or very large variant columns though.
When data is parsed into a variant, SnowFlake does do some processing on them for the purpose of indexing, null handling, and performance. Notably, Variant nulls, the null value inside of the JSON, can be compared as equal to itself, unlike SQL NULLS. Also, data types like numbers and dates are stored as strings when they are within a variant, and use string equality. So if your data sources handle data types differently you might be able to see a scenario where 2020-01-01 12:00:00.00
is treated as not equal to 2020-01-01 12:00:00
, but I haven't tested this.
This processing is why you noticed (in a separate comment) that {"a":1,"b":2}
is stored the same way as {"b": 2, "a": 1}
and thus they are "equal to" each other. So that could technically count as a "false positive", joining when you don't expect them to be equal. But as far as I know, this processing is consistent and you shouldn't get false negatives.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With