find unique values from a large file

Question

I have a large file (say 10 terabytes) with stream of MD5 hashes (which contains duplicates), I am given a memory of 10MB(very limited) and unlimited hard disc space. Find all the unique hashes(eliminating duplicates) using given conditions. Please help, this is obviously not a homework question

Zim-Zam O'Pootertoot · Accepted Answer

You can sort the hashes with an external sorting algorithm (e.g. with a polyphase merge sort), after which you just need to traverse the file and skip any hashes that equal the most recent hash

hash mostRecentHash;
while(fileHasHashes) {
    temp = fileWithDuplicates.readHash();
    if(!hashesAreEqual(mostRecentHash, temp)) {
        mostRecentHash = temp;
        fileWithoutDuplicates.writeHash(mostRecentHash);
    }
}

jxh · Answer

If performance doesn't matter, and your file system has no limits, then you can simply create a file for every hash. If during creation, you encounter EEXIST, then you have a duplicate, and it can be skipped.

for (each hash) {
    r = open(hash_to_filename(hash), O_CREAT|O_EXCL);
    if (r < 0) {
        if (errno == EEXIST) continue;
        perror(hash);
        exit(EXIT_FAILURE);
    }
    close(r);
    output(hash);
}

The advantage of this is that it preserves the order of the hash values first occurrence in the stream.

The actual performance of this solution depends on the performance of the file system. If the files are organized in a B-Tree, then the performance will be roughly O(N log(N)). If the file system uses a hash table to organize the files, then the performance is expected to be O(N), but it depends on how often collisions occur (and the constant factor is high, because of the disk access).

find unique values from a large file

Tags:

c

memory-management

algorithm

hash

bigdata

username

2 Answers

Zim-Zam O'Pootertoot

jxh

Recent Activity

Donate For Us

find unique values from a large file

Tags:

c

memory-management

algorithm

hash

bigdata

username

2 Answers

Zim-Zam O'Pootertoot

jxh

Related questions

Recent Activity

Donate For Us