I am trying to find something potentially faster then SHA256. I have over 1 billion records I need to hash and verify if they are unique. I am currently running it through an MD5 which seems pretty fast then through the sha256 to avoid collisions. Running them in that order seems to give me a little performance boost but I still need it faster. I am looking for the names or examples of some hashes done in c# or some pseudo-code so I can recreate it in c#.
There is a lot of dubious information in the answers here. You tagged your question with cryptography and only mention cryptographic hash functions, but it sounds like you don't really need cryptographic security, in particular because you say:
I have over 1 billion records I need to hash and verify if they are unique.
There are four properties to a cryptographic hash function:
- it is easy to compute the hash value for any given message
- it is infeasible to generate a message that has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash.
You're really only interested in the first quality and the uniqueness is a smaller scale requirement only partially related to the other three properties of cryptographic security.
There is overhead in cryptographic security. You don't need it, and you're interested in speed, so why not skip it? The hash width of MD5 and the SHA family are admittedly large enough for your purposes.
Check out the list of hash functions on wikipedia, or check out the article on normal hash functions. More to the point, what's wrong with the built in .NET hashing functions? Have you tried just deferring to the Object.GetHashCode() method? That MSDN reference has a lot to say about using hash functions. You don't say much about the data you're hashing, so it's hard to say whether the output would be unique between your objects or not. How are you feeding the object into the MD5 hasher? I presume you're taking the binary representation of it. A similar approach could be used to use the built-in non-crypto hash function.
You may be concerned about the uniqueness of the built in hash functions. They do only return a regular int, which is 2^32, only about 4 times larger than the data set you're working with. However, you always need to have a back up plan for hash functions. Collisions are infeasible, not impossible. The standard fallback is to perform a more expensive comparison, usually a reference comparison and a field-wise value comparison.
If you aren't prepared to do an exact comparison on your hash outputs, you're basically counting down until you get a false positive. That might not be a big deal for you: only you can judge what the downside there is.
Additionally, performing another hash function computation is probably not much faster than the direct comparison. You're better off on all counts to go with the sure thing and perform the lengthy, direct comparison.
Another common anti-collision technique is to use multiple keys. So if your data points have several large subcomponents, you hash and compare the independently. If it has some large and some small components (say some simple numeric types), you hash the large ones and do a direct comparison on the small ones. If they have some data that's easy to take the ordinal of (say the lengths of strings or the size of some containers), you can perform the direct comparison on those bits.
If that doesn't work out for you, take a look at implementations of the other hash functions listed on the wiki. Here's a pretty good reference for MurmerHash3, which can compute 32 bit or 128 bit hash values. There are other hash functions in the list that have long hash widths as well and also have C# libraries available. But as that reference points out, Murmurhash is way faster than MD5 and SHA functions, although it doesn't do a direct comparison to the Object.GetHashCode method I mentioned above.
How about doing something different?
Use a simple hashing function on each record, like one you would use when inserting the record into a hash table, perhaps mapping each record to a 32 bit INT. Then if there were a hash collision you then compare the colliding records for uniqueness.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With