Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way of storing exact set membership?

I have N slots. There are M slots occupied.

I want to be able to tell exactly whether each slot is occupied. (No Bloom filter answers, please.)

What is the absolute most storage-space efficient way of storing this information for M << N?

  • Guess 0: A bitmap of N bits.
  • Guess 1: An array of the positions of the occupied slots. Reasonably good for small M.
  • Guess 2: p0 + (M-1)p1 + (M-1)(M-2)p2 + ... where pX is the position of an occupied slot, among the remaining unoccupied slots. This is slightly more efficient than guess 1, as the choice of unoccupied slot narrows as slots are filled.

Guess 2 still has a lot of waste; it includes the order in which the slots were filled, which is information that is not required.

What method is more efficient than Guess 2?

like image 791
fadedbee Avatar asked Dec 17 '25 06:12

fadedbee


2 Answers

If M and N are known, then one way of achieving the best compression is to store the index of the combination.

There are t= N!/((N-M)!*M!) ways of choosing the M slots to be filled, so you will always need at least log2(t) bits to represent this information.

Storing the index of the combination allows you to use exactly ceil(log2(t))bits.

like image 162
Peter de Rivaz Avatar answered Dec 19 '25 22:12

Peter de Rivaz


Assuming that there is no other information about the distribution of the values (i.e., every possible sample of M values is equally probable) then the optimal compression technique is that given by @PeterdeRivaz, which is to simply using the ordinal of the enumeration of the sample out of the set of possible samples.

However, that is not trivial to compute, since the enumeration requires arithmetic on very large numbers.

Instead, it is possible to use a variant on Golomb compression, with only a small impact on the compression ratio.

Assume the numbers in the sample are in order. We start by computing the successive differences. Because we will never have two equal numbers, the sequence of differences will never include a 0; to gain a tiny additional compression, we start with one more than the first element of the sample, rather than the first element itself -- which means that the sequence never contains a 0 -- and then subtract one from each value. We now select some convenient number of bits k, and encode each value δ in the sequence as follows:

  1. While δ > 2k, send a 1 bit and subtract 2k from δ

  2. Now δ can be written in k bits. Send a 0 followed by the k-bit value of δ

We can choose k as ⌊log2(N/M)⌋, which means that N < 2k+1M. (Taking the ceiling would be another possibility.) Consequently, the number of 1 bits sent in all of the iterations of step 1 in the above algorithm is less than 2M (because each 1 accounts for 2k of the cumulative sum of the series, which is less than N). Each step 2 sends exactly k + 1 bits, and there are exactly M executions of step 2, one for each value in the sample. Thus, the total number of bits sent is somewhere between M × (k + 1) and M × (k + 2). But since we know k < log2(N/M) + 1, the total size of the transmission is certainly less than M log2 N - M log2 M + 3M). (We also have to send the parameters k and M, so there is a bit of overhead.)

Now, let's consider the optimal transmission size. There are N choose M possible samples, so the size of the enumeration index in bits will be log2(N choose M). If N ≫ M, we can approximate N choose M as NM/M!, and then using Stirling's approximation we get:

log(N choose M) ≈ M log N − M log M + M

(That's actually a slight overestimate, but it is asymptotically correct.)

Thus, the difference between the compressed sequence and the information-theoretic limit is less than 2 bits per value. (In practice, it is generally around one bit per value, because step 1 executes far less than the maximum number of times.)

like image 42
rici Avatar answered Dec 19 '25 23:12

rici



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!