HyperLogLog estimates cardinality of a multiset. Is it possible to extend it to handle multiple multisets? Like, instead of just supporting query estimateCardinality(), it would support estimateCardinality(multiset_id). I'm trying to avoid having a dictionary of HyperLogLog values for each multiset_id.
Is there another way (data structure) to achieve this?
The following idea might help out when you have a largish number of multiesets with high-variance in their cardinalities; that is, some have large size, and some have small size. It does not require you to estimate in advance which will be small and which will be large.
You can build a Linear Probabilistic Counter, with a small change. The original data structure has a (logical) boolean in each position. Here, each position would itself be a classial set. Instead of setting a bit on an
insert(element) 
op if it falls in this position, you would insert id into the set on an 
insert(element, id)
There are some common-sense tricks you ould do to conserve space. E.g., you could decide that if id appears in over a certain fraction of the bins, then it is not stored in bin sets, but rather in a separate bitmap over all bins. 
Overall, then, if you have both small and large sets, you would end up with the following:
a bitmap for each large set (this is the same cost per item for your dictionary of counters idea)
entries in some of the bits' sets for each small set (possibly much smaller than your dictionary of counters idea)
Since the data structure can switch, for a specific multiset, from the latter to the former - it might save space relative to the dictionary of counters idea, which might be considered a premature pessimization.
YMMV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With