I have a large file (say 10 terabytes) with stream of MD5 hashes (which contains duplicates), I am given a memory of 10MB(very limited) and unlimited hard disc space. Find all the unique hashes(eliminating duplicates) using given conditions. Please help, this is obviously not a homework question
You can sort the hashes with an external sorting algorithm (e.g. with a polyphase merge sort), after which you just need to traverse the file and skip any hashes that equal the most recent hash
hash mostRecentHash;
while(fileHasHashes) {
temp = fileWithDuplicates.readHash();
if(!hashesAreEqual(mostRecentHash, temp)) {
mostRecentHash = temp;
fileWithoutDuplicates.writeHash(mostRecentHash);
}
}
If performance doesn't matter, and your file system has no limits, then you can simply create a file for every hash. If during creation, you encounter EEXIST, then you have a duplicate, and it can be skipped.
for (each hash) {
r = open(hash_to_filename(hash), O_CREAT|O_EXCL);
if (r < 0) {
if (errno == EEXIST) continue;
perror(hash);
exit(EXIT_FAILURE);
}
close(r);
output(hash);
}
The advantage of this is that it preserves the order of the hash values first occurrence in the stream.
The actual performance of this solution depends on the performance of the file system. If the files are organized in a B-Tree, then the performance will be roughly O(N log(N)). If the file system uses a hash table to organize the files, then the performance is expected to be O(N), but it depends on how often collisions occur (and the constant factor is high, because of the disk access).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With