I am having issues retrieving the latest value in a satellite table when some data is changed back to a former value.
The database is Snowflake.
As per Data Vault 2.0, I am currently using the hash diff function to assess whether to insert a new record in a satellite table, like this:
INSERT ALL
WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff) = 0
THEN
INTO SAT_ACCOUNT_DETAILS (MD5_HUB_ACCOUNT
, HASH_DIFF
, ACCOUNT_CODE
, DESCRIPTION
, SOME_DETAIL
, LDTS)
VALUES (MD5_AE_Account
, AccHashDiff
, AccountCode
, Description
, SomeDetail
, LoadDTS)
SELECT DISTINCT
MD5(AccountId) As MD5_Account
, MD5(UPPER(COALESCE(TO_VARCHAR(AccountCode), '')
|| '^' || COALESCE(TO_VARCHAR(Description), '')
|| '^' || COALESCE(TO_VARCHAR(SomeDetail), '')
)) AS AccHashDiff
, AccountCode
, Description
, SomeDetail
, LoadDTS
FROM source_table;
The first time, a new record with AccountCode = '100000' and SomeDetail = 'ABC' is added:
| MD5_HUB_ACCOUNT | HASH_DIFF | ACCOUNT_CODE | DESCRIPTION | SOME_DETAIL | LDTS |
|---|---|---|---|---|---|
| c81e72... | 8d9d43... | 100000 | An Account | ABC | 2021-04-08 10:00 |
An hour later, an update changes the value of SomeDetail to 'DEF', this is the resulting table:
| MD5_HUB_ACCOUNT | HASH_DIFF | ACCOUNT_CODE | DESCRIPTION | SOME_DETAIL | LDTS |
|---|---|---|---|---|---|
| c81e72... | 8d9d43... | 100000 | An Account | ABC | 2021-04-08 10:00 |
| c81e72... | a458b2... | 100000 | An Account | DEF | 2021-04-08 11:00 |
A third update sets the value of SomeDetail back to 'ABC', but the record is not inserted in the satellite table, because the value of the hash diff is the same as the first inserted record (i.e. 8d9d43...).
If I query which is the latest record in the satellite table, the LDTS column tells me it's the one with 'DEF' which is not the desired result. Instead, I should have a record with SomeDetail = 'ABC' and LDTS = '2021-04-08 12:00'.
What is the correct approach to this? If I add LoadDTS to the hash diff, a new record will be created each time an update is pushed, which is not the desired result either.
As you (and also the standard) mentionned, you need to compare to the last effective record.
I'm not an expert with Snowflake, but it might look like this :
INSERT ALL
WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff AND AD.LDTS = (SELECT MAX(LDTS) FROM SAT_ACCOUNT_DETAILS MAD WHERE MAD.MD5_HUB_ACCOUNT = AD.MD5_HUB_ACCOUNT)) = 0
THEN ....
By adding "AD.LDTS = (SELECT MAX(LDTS) FROM....." to the query, you make sure you test against the latest data and not historical data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With