There are around 3 millions of arrays - or Python lists\tuples (does not really matter). Each array consists of the following elements:
['string1', 'string2', 'string3', ...]  # totally, 10000 elements
These arrays should be stored in some kind of key-value storage. Let's assume now it's a Python's dict, for a simple explanation.
So, 3 millions of keys, each key represents a 10000-elements array.
Lists\tuples or any other custom thing - it doesn't really matter. What matters is that arrays should consist strings - utf8 or unicode strings, from 5 to about 50 chars each. There are about 3 millions of possible strings as well. It is possible to replace them with integers if it's really needed, but for more efficient further operations, I would prefer to have strings.
Though it's hard to give you a full description of the data (it's complicated and odd), it's something similar to synonyms - let's assume we have 3 millions of words - as the dict keys - and 10k synonyms for each of the word - or element of the list.
Like that (not real synonyms but it will give you the idea):
{
    'computer': ['pc', 'mac', 'laptop', ...],  # (10k totally)
    'house': ['building', 'hut', 'inn', ...],  # (another 10k)
     ...
}
Elements - 'synonyms' - can be sorted if it's needed.
Later, after the arrays are populated, there's a loop: we go thru all the keys and check if some var is in its value. For example, user inputs the words 'computer' and 'laptop' - and we must quickly reply if the word 'laptop' is a synonym of the word 'computer'. The issue here is that we have to check it millions of time, probably 20 millions or so. Just imagine we have a lot of users entering some random words - 'computer' and 'car', 'phone' and 'building', etc. etc. They may 'match', or they may not 'match'.
So, in short - what I need is to:
I should be able to keep memory usage below 30GB. Also I should be able to perform all the iterations in less than 10 hours on a Xeon CPU.
It's ok to have around 0.1% of false answers - both positive and negative - though it would be better to reduce them or don't have them at all.
What is the best approach here? Algorithms, links to code, anything is really appreciated. Also - a friend of mine suggested using bloom filters or marisa tries here - is he right? I didn't work with none of them.
I would map each unique string to a numeric ID then associate a bloom filter with around 20 bits per element for your <0.1% error rate. 20 bits * 10000 elements * 3 million keys is 75GB so if you are space limited, then store a smaller less accurate filter in memory and the more accurate filter on disk which is called up if the first filter says it might be a match.
There are alternatives, but they will only reduce the size from 1.44·n·ln2(1/ε) to n·ln2(1/ε) per key, in your case ε=0.001 so the theoretical limit is a data structure of 99658 bits per key, or 10 bits per element, which would be 298,974,000,000 bits or 38 GB.
So 30GB is below the theoretical limit for a data structure with the performance and number of entries that you require, but within the ball park.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With