I'm currently using the linux md5sum command in a bash script on a very lightweight (low processor/low memory) linux device to return and record the checksums on thousands of similarly-named 32MB files in a single directory.
md5sum ./file* >fingerprint.txt
The next day, I repeat the process on the same set of files and programmatically compare the results from the prior day's hashes. When I find that the fingerprint of a file has changed between day1 and day2 I take action on that specific file. If the file remained unchanged I take no action and continue my comparison.
The problem that I'm running into is that the md5 method takes a LONG time to process on each file. The comparison needs to be completed within a certain time-frame and I'm starting to bump into incidents where the entire process simply takes too long.
Is there some other method/tool I could be using to reliably perform this kind of comparison? (note, it's not adequate enough to perform a date comparison on the files and the file sizes remain a constant 32MB)
MD5 is supposed to be fast among cryptographic hash functions. But any given implementation may make choices which, on a specific machine, imply suboptimal performances. What kind of hardware do you use ? Processor type and L1 cache size are quite important.
You may want to have a look at sphlib: this is a library implementing many cryptographic hash functions, in C (optimized, but portable) and Java. The C code can be compiled with an additional "small footprint" flag which helps on small embedded platforms (mainly due to L1 cache size issues). Also, the code comes with a md5sum-like command-line utility, and a speed benchmark tool.
Among the hash functions, MD4 is usually the fastest, but on some platforms Panama, Radiogatun[32] and Radiogatun[64] can achieve similar or better performance. You may also want to have a look at some of the SHA-3 candidates, in particular Shabal, which is quite fast on small 32-bit systems.
Important note: some hash functions are "broken", in that it is possible to create collisions: two distinct input files, which hash to the same value (exactly what you want to avoid). MD4 and MD5 are thus "broken". However, a collision must be done on purpose; you will not hit one out of (bad) luck (probabilities are smaller than having a "collision" due to a hardware error during the computation). If you are in a security-related situation (someone may want to actively provoke a collision) then things are more difficult. Among those I cite, the Radiogatun and Shabal functions are currently unbroken.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With