I'm very new to Java so forgive me if I'm doing something terribly wrong.
I'm working on a project where I need to quickly scan a very large volume of data (CSV with 50 million lines or more, 5 entries per line) for repeats. I've resorted to using a HashMap, since its .contains() method is fast.
However, I end up having to store a million keys or more in the map. Each key is associated with an int[] array, which would have from 1 to 100 entries as well. So obviously, I end up getting an OutOfMemory error unless I'm using a laptop with ~16 GB of RAM.
I was thinking that once the HashMap gets more than N keys or a key gets more than N entries, I could write it somewhere and clear it. However, not all keys or values are found at once, so I need to be able to add to the written hashmap, and not overwrite it.
I've searched far and wide and still can't find a way to do it, so thanks a lot to whoever can help!
You have quite a lot of options here, I'll list some of them:
-Xmx compiler flag - e.g. -Xmx3G as Dimitry suggests will give you three gigabytes of heap, vs. the default which is <= 1GB.Store Less Data: You're currently storing the whole row of "1 to 100 entries", when really all we need is to know if the data is unique or not. The Arrays.hashCode() function gives you a reasonably accurate indication that a row is unique in a single int, so we can put this to use to limit the amount of data you need to hold in memory:
Construct two HashSet<Integer> objects, called seen and seenTwice. Loop over your data, and add each array's hash to seen, and to seenTwice if it was already in seen, like so:
int[] arr = ... // construct the row's array
int hash = Arrays.hashCode(arr);
if(!seen.add(hash)) {
// add returns false if we've already seen this hash
seenTwice.add(hash);
}
Now we have a set of hashes that we saw two or more times; in theory, this will be a much smaller set than the number of rows in our file. We can let seen get garbage collected, and re-read the file using seenTwice to populate the HashSet<int[]> rows of actual data, like you were first trying to do:
int[] arr = ... // construct the row's array
int hash = Arrays.hashCode(arr);
if(seenTwice.contains(hash)) {
// If the hash isn't in seenTwice, we know it's not a duplicate
if(!rows.add(arr)) {
System.out.println("Row "+Arrays.toString(arr))+" is a duplicate!");
}
}
Use Bash: If you're willing to forgo Java, you can find duplicates very easily with a basic bash command:
cat filename | sort | uniq -d
Use a Database: You can, like you were alluding, use some out-of-memory solution, notably a database. A good, easy to use Java database is H2, but covering using it is outside the scope of this answer. Suffice to say, you could load your data from the file into a database, and then simply query for duplicate rows: Finding duplicate values in a SQL table
But setting up a DB simply to find duplicates in 50 million lines is overkill. I wouldn't reccomend this option.
See also: Script to find duplicates in a csv file
I am not aware of what exactly you want to do. But would it be of help if you used a SQL database? Then you could save your values externally, and you would not need this large amount of RAM.
If this is not applicable to you, it's unfortunate. When I read your question, I was sure that using a db would solve all your problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With