Duplicate detection on very large data

Question

I'm very new to Java so forgive me if I'm doing something terribly wrong.

I'm working on a project where I need to quickly scan a very large volume of data (CSV with 50 million lines or more, 5 entries per line) for repeats. I've resorted to using a HashMap, since its .contains() method is fast.

However, I end up having to store a million keys or more in the map. Each key is associated with an int[] array, which would have from 1 to 100 entries as well. So obviously, I end up getting an OutOfMemory error unless I'm using a laptop with ~16 GB of RAM.

I was thinking that once the HashMap gets more than N keys or a key gets more than N entries, I could write it somewhere and clear it. However, not all keys or values are found at once, so I need to be able to add to the written hashmap, and not overwrite it.

I've searched far and wide and still can't find a way to do it, so thanks a lot to whoever can help!

dimo414 · Accepted Answer

You have quite a lot of options here, I'll list some of them:

More Memory: It sounds like you've already tried giving Java more memory, but if not, using the -Xmx compiler flag - e.g. -Xmx3G as Dimitry suggests will give you three gigabytes of heap, vs. the default which is <= 1GB.
Store Less Data: You're currently storing the whole row of "1 to 100 entries", when really all we need is to know if the data is unique or not. The Arrays.hashCode() function gives you a reasonably accurate indication that a row is unique in a single int, so we can put this to use to limit the amount of data you need to hold in memory:
1. Construct two HashSet<Integer> objects, called seen and seenTwice. Loop over your data, and add each array's hash to seen, and to seenTwice if it was already in seen, like so:
```
int[] arr = ... // construct the row's array
int hash = Arrays.hashCode(arr);
if(!seen.add(hash)) {
  // add returns false if we've already seen this hash
  seenTwice.add(hash);
}
```
2. Now we have a set of hashes that we saw two or more times; in theory, this will be a much smaller set than the number of rows in our file. We can let seen get garbage collected, and re-read the file using seenTwice to populate the HashSet<int[]> rows of actual data, like you were first trying to do:
```
int[] arr = ... // construct the row's array
int hash = Arrays.hashCode(arr);
if(seenTwice.contains(hash)) {
  // If the hash isn't in seenTwice, we know it's not a duplicate
  if(!rows.add(arr)) {
    System.out.println("Row "+Arrays.toString(arr))+" is a duplicate!");
  }
}
```
Use Bash: If you're willing to forgo Java, you can find duplicates very easily with a basic bash command:
```
cat filename | sort | uniq -d
```
Use a Database: You can, like you were alluding, use some out-of-memory solution, notably a database. A good, easy to use Java database is H2, but covering using it is outside the scope of this answer. Suffice to say, you could load your data from the file into a database, and then simply query for duplicate rows: Finding duplicate values in a SQL table

But setting up a DB simply to find duplicates in 50 million lines is overkill. I wouldn't reccomend this option.

See also: Script to find duplicates in a csv file

PKlumpp · Answer

I am not aware of what exactly you want to do. But would it be of help if you used a SQL database? Then you could save your values externally, and you would not need this large amount of RAM.

If this is not applicable to you, it's unfortunate. When I read your question, I was sure that using a db would solve all your problems.

Duplicate detection on very large data

Tags:

java

duplicates

out-of-memory

larrienea

2 Answers

dimo414

PKlumpp

Recent Activity

Donate For Us

Duplicate detection on very large data

Tags:

java

duplicates

out-of-memory

larrienea

2 Answers

dimo414

PKlumpp

Related questions

Recent Activity

Donate For Us