Most efficient way of storing exact set membership?

Question

I have N slots. There are M slots occupied.

I want to be able to tell exactly whether each slot is occupied. (No Bloom filter answers, please.)

What is the absolute most storage-space efficient way of storing this information for M << N?

Guess 0: A bitmap of N bits.
Guess 1: An array of the positions of the occupied slots. Reasonably good for small M.
Guess 2: p0 + (M-1)p1 + (M-1)(M-2)p2 + ... where pX is the position of an occupied slot, among the remaining unoccupied slots. This is slightly more efficient than guess 1, as the choice of unoccupied slot narrows as slots are filled.

Guess 2 still has a lot of waste; it includes the order in which the slots were filled, which is information that is not required.

What method is more efficient than Guess 2?

Peter de Rivaz · Accepted Answer

If M and N are known, then one way of achieving the best compression is to store the index of the combination.

There are t= N!/((N-M)!*M!) ways of choosing the M slots to be filled, so you will always need at least log2(t) bits to represent this information.

Storing the index of the combination allows you to use exactly ceil(log2(t))bits.

rici · Answer

Assuming that there is no other information about the distribution of the values (i.e., every possible sample of M values is equally probable) then the optimal compression technique is that given by @PeterdeRivaz, which is to simply using the ordinal of the enumeration of the sample out of the set of possible samples.

However, that is not trivial to compute, since the enumeration requires arithmetic on very large numbers.

Instead, it is possible to use a variant on Golomb compression, with only a small impact on the compression ratio.

Assume the numbers in the sample are in order. We start by computing the successive differences. Because we will never have two equal numbers, the sequence of differences will never include a 0; to gain a tiny additional compression, we start with one more than the first element of the sample, rather than the first element itself -- which means that the sequence never contains a 0 -- and then subtract one from each value. We now select some convenient number of bits k, and encode each value δ in the sequence as follows:

While δ > 2^k, send a 1 bit and subtract 2^k from δ
Now δ can be written in k bits. Send a 0 followed by the k-bit value of δ

We can choose k as ⌊log₂(N/M)⌋, which means that N < 2^k+1M. (Taking the ceiling would be another possibility.) Consequently, the number of 1 bits sent in all of the iterations of step 1 in the above algorithm is less than 2M (because each 1 accounts for 2^k of the cumulative sum of the series, which is less than N). Each step 2 sends exactly k + 1 bits, and there are exactly M executions of step 2, one for each value in the sample. Thus, the total number of bits sent is somewhere between M × (k + 1) and M × (k + 2). But since we know k < log₂(N/M) + 1, the total size of the transmission is certainly less than M log₂ N - M log₂ M + 3M). (We also have to send the parameters k and M, so there is a bit of overhead.)

Now, let's consider the optimal transmission size. There are N choose M possible samples, so the size of the enumeration index in bits will be log₂(N choose M). If N ≫ M, we can approximate N choose M as N^M/M!, and then using Stirling's approximation we get:

log(N choose M) ≈ M log N − M log M + M

(That's actually a slight overestimate, but it is asymptotically correct.)

Thus, the difference between the compressed sequence and the information-theoretic limit is less than 2 bits per value. (In practice, it is generally around one bit per value, because step 1 executes far less than the maximum number of times.)

Most efficient way of storing exact set membership?

Tags:

algorithm

serialization

set

membership

fadedbee

2 Answers

Peter de Rivaz

rici

Recent Activity

Donate For Us

Most efficient way of storing exact set membership?

Tags:

algorithm

serialization

set

membership

fadedbee

2 Answers

Peter de Rivaz

rici

Related questions

Recent Activity

Donate For Us