Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find unique values from a large file

I have a large file (say 10 terabytes) with stream of MD5 hashes (which contains duplicates), I am given a memory of 10MB(very limited) and unlimited hard disc space. Find all the unique hashes(eliminating duplicates) using given conditions. Please help, this is obviously not a homework question

like image 254
username Avatar asked Dec 07 '25 10:12

username


2 Answers

You can sort the hashes with an external sorting algorithm (e.g. with a polyphase merge sort), after which you just need to traverse the file and skip any hashes that equal the most recent hash

hash mostRecentHash;
while(fileHasHashes) {
    temp = fileWithDuplicates.readHash();
    if(!hashesAreEqual(mostRecentHash, temp)) {
        mostRecentHash = temp;
        fileWithoutDuplicates.writeHash(mostRecentHash);
    }
}
like image 77
Zim-Zam O'Pootertoot Avatar answered Dec 10 '25 10:12

Zim-Zam O'Pootertoot


If performance doesn't matter, and your file system has no limits, then you can simply create a file for every hash. If during creation, you encounter EEXIST, then you have a duplicate, and it can be skipped.

for (each hash) {
    r = open(hash_to_filename(hash), O_CREAT|O_EXCL);
    if (r < 0) {
        if (errno == EEXIST) continue;
        perror(hash);
        exit(EXIT_FAILURE);
    }
    close(r);
    output(hash);
}

The advantage of this is that it preserves the order of the hash values first occurrence in the stream.

The actual performance of this solution depends on the performance of the file system. If the files are organized in a B-Tree, then the performance will be roughly O(N log(N)). If the file system uses a hash table to organize the files, then the performance is expected to be O(N), but it depends on how often collisions occur (and the constant factor is high, because of the disk access).

like image 21
jxh Avatar answered Dec 10 '25 10:12

jxh