Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best suited data-structure for prefix based searches

I have to maintain a in-memory data-structure of Key-Value pair. I have following constraints:

  1. Both key and values are text strings of length 256 and 1024 respectively. Any key generally looks like k1k2k3k4k5, each k(i) being 4-8 byte string in itself.
  2. As far as possible, in-memory data-structure should have contiguous memory. I have 400 MB worth of Key-Value pair and am allowed 120% worth of allocation. (Additional 20% for metadata, only if needed.)
  3. DS will have following operations:
  4. Add [Infrequent Operation]: Typical signature looks like void add_kv(void *ds, char *key, char *value);
  5. Delete[Infrequent Operation]: Typical signature looks like void del_kv(void *ds, char *key);
  6. LookUp [MOST FREQUENT OPERATION]: Typical signature looks like char *lookup(void *ds, char *key);
  7. Iterate [MOST FREQUENT OPERATION]: This operation is prefix based. It allocates an iterator i.e iterates the whole DS and returns list of key-values that match prefix_key (e.g. "k1k2k3.*", k(i) defined as above). Every iteration iterates on this iterator(list). Freeing the iterator frees the list. Typically expect an Iterator to return 100 KB worth of key-value pair in 400 MB DS (100KB:400 MB :: 1:4000). Typical signature looks like void *iterate(void *ds, char *prefix_key);
  8. Bullet 6 and Bullet 7 being most frequent operation, needs to be optimized for.

My question is what is the best suited data-structure for above constraints?

I have considered hash. Add/delete/lookup could be done in o(1) as I have sufficient memory but it is not optimum for iteration. Hash-of-hash (hash on k1 then on k2 then on k3...) or array of hash could be done but it then violates Bullet 2. What other options do I have?

like image 955
Subhajit Kundu Avatar asked Oct 27 '25 07:10

Subhajit Kundu


1 Answers

I would probably use something like a B+tree for this: https://en.wikipedia.org/wiki/B%2B_tree

Since memory-efficiency is important to you, when a leaf block gets full you should redistribute keys among several blocks if possible to ensure that blocks are always >= 85% full. Block size should be large enough that the overhead from internal nodes is only a few %.

You can also optimize storage in the leaf blocks, since most of the keys in a block will have a long common prefix that you can figure out from the blocks in the higher levels. You can therefore remove all the copies of the common prefix from the keys in the leaf blocks, and your 400MB of key-value pairs will take substantially less than 400MB of RAM. This will complicate the insert process somewhat.

There are other things you can do to compress this structure further, but that gets complicated fast and it doesn't sound like you need it.

like image 74
Matt Timmermans Avatar answered Oct 29 '25 21:10

Matt Timmermans