Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make sure that a hash function won't produce the same cypher for 2+ different entries?

Tags:

hash

Edit: some people flagged this question as a potential duplicate of this other one. While I agree that knowing how the birthday paradox applies to hashing functions, the 2 questions (and respective answers) address 2 different, albeit related, subjects. The other question is asking "what are the odds of collision", whereas this question main focus is "how can I make sure that collision never happens".

I have a data lake stored in S3 where each day an ETL script dumps additional data from the day before.

Due to how the pipeline is built, it is possible for a very inconsiderate user that has admin access to produce duplicates in said data lake by manually interacting with the dump files coming from our OLTP database, and triggering the ETL script when it's not supposed to.

I thought that a good idea to prevent data duplication was to insert a form of security measure in my ETL script:

  • Produce a hash for each entry.
  • Store said hashes somewhere else (like a dynamodb table).
  • Whenever new data comes in, hash that as well and compare it with the already existing hashes.
  • If any of new hash is in the existing hashes, reject the associated entry entirely.

However, I know very little about hashing and I was reading that, although unlikely, 2 different sources can produce the same hash.

I understand it's really hard for it to happen in this situation, but I was wondering if there is a way to be 100% sure about it.

Any idea is much appreciated.

like image 450
wtfzambo Avatar asked Oct 29 '25 21:10

wtfzambo


1 Answers

Long answer: what you want to study and explore is called "perfect hashing" (ie hashing guaranteed not to have collisions. https://en.wikipedia.org/wiki/Perfect_hash_function

Short answer: A cryptographic collision resistant algorithm like sha-1 is probably safe to use for all but the largest (PBs a day) datasets and even then its probably all right. Git uses sha-1 internally and code repositories probably deal with the most files on the planet and rarely have collisions. See for details: https://ericsink.com/vcbe/html/cryptographic_hashes.html#:~:text=Git%20uses%20hashes%20in%20two,computed%20when%20it%20was%20stored.

Medium answer: this is actually a pretty hard problem overall and a frequent area of study for computer science and a lot depends on your particular use case and the context you're operating in. Cuckoo hashing, collision resistant algorithms, and hashing in general are probably all good terms to research. There's also a lot of art and science behind space (memory) and time (computer power needed) when picking these methods. A good rule of thumb is that perfect hashing will generally take up more space and time than a collision resistant cryptographic hash like sha-1.

like image 58
daidoji70 Avatar answered Oct 31 '25 10:10

daidoji70



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!