Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Vault 2 - Hash diff and recurring data changes

I am having issues retrieving the latest value in a satellite table when some data is changed back to a former value.

The database is Snowflake.

As per Data Vault 2.0, I am currently using the hash diff function to assess whether to insert a new record in a satellite table, like this:

INSERT ALL
  WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff) = 0
  THEN
    INTO SAT_ACCOUNT_DETAILS (MD5_HUB_ACCOUNT
                            , HASH_DIFF
                            , ACCOUNT_CODE
                            , DESCRIPTION
                            , SOME_DETAIL
                            , LDTS)
    VALUES (MD5_AE_Account
          , AccHashDiff
          , AccountCode
          , Description
          , SomeDetail
          , LoadDTS)
SELECT DISTINCT
    MD5(AccountId) As MD5_Account
  , MD5(UPPER(COALESCE(TO_VARCHAR(AccountCode), '')
    || '^' || COALESCE(TO_VARCHAR(Description), '')
    || '^' || COALESCE(TO_VARCHAR(SomeDetail), '')
    )) AS AccHashDiff
  , AccountCode
  , Description
  , SomeDetail
  , LoadDTS
FROM source_table;

The first time, a new record with AccountCode = '100000' and SomeDetail = 'ABC' is added:

MD5_HUB_ACCOUNT HASH_DIFF ACCOUNT_CODE DESCRIPTION SOME_DETAIL LDTS
c81e72... 8d9d43... 100000 An Account ABC 2021-04-08 10:00

An hour later, an update changes the value of SomeDetail to 'DEF', this is the resulting table:

MD5_HUB_ACCOUNT HASH_DIFF ACCOUNT_CODE DESCRIPTION SOME_DETAIL LDTS
c81e72... 8d9d43... 100000 An Account ABC 2021-04-08 10:00
c81e72... a458b2... 100000 An Account DEF 2021-04-08 11:00

A third update sets the value of SomeDetail back to 'ABC', but the record is not inserted in the satellite table, because the value of the hash diff is the same as the first inserted record (i.e. 8d9d43...).

If I query which is the latest record in the satellite table, the LDTS column tells me it's the one with 'DEF' which is not the desired result. Instead, I should have a record with SomeDetail = 'ABC' and LDTS = '2021-04-08 12:00'.

What is the correct approach to this? If I add LoadDTS to the hash diff, a new record will be created each time an update is pushed, which is not the desired result either.

like image 847
Roberto B. Avatar asked Oct 28 '25 17:10

Roberto B.


1 Answers

As you (and also the standard) mentionned, you need to compare to the last effective record.

I'm not an expert with Snowflake, but it might look like this :

INSERT ALL
  WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff AND AD.LDTS = (SELECT MAX(LDTS) FROM SAT_ACCOUNT_DETAILS MAD WHERE MAD.MD5_HUB_ACCOUNT = AD.MD5_HUB_ACCOUNT)) = 0
  THEN ....

By adding "AD.LDTS = (SELECT MAX(LDTS) FROM....." to the query, you make sure you test against the latest data and not historical data

like image 181
MLeblanc Avatar answered Oct 31 '25 12:10

MLeblanc