Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

kdb q - efficiently count tables in flatfiles

Tags:

kdb+

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:

c:([] filename:system "ls ",basepath; 
      tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)

which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?

Thanks for the help

like image 316
tenticon Avatar asked Dec 10 '25 17:12

tenticon


1 Answers

If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:

q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3

This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient. The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.

The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.

However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.

UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:

f:{
  b:read1(y;0;x);
  if[not 0x62630b~b[2 4 5];'`$"not a table"];
  cc:first first((),"i";(),4)1:b 7+til 4;
  if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
  c:`$"\000" vs "c"$b[11+til ce];
  n:first first((),"i";(),4)1:b[(20+ce)+til 4];
  :`columns`rows!(c;n);
  }[2000]

The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.

The given code works in the following way:

  1. reads a chunk from the front of the file

  2. validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)

  3. reads the count of symbols in the list of columns as a little endian 4 byte int
  4. scans through the null terminated strings for that many zero bytes
  5. if the columns are so numerous or verbose that we don’t have them all in this chunk it will double the size of the chunk and try again
  6. turn the strings into symbols
  7. using the offset we get from the end of the column list, skip a bit more of the header for the mixed list of columns
  8. then read the count from the header of the first column

Hope this answers your question!

like image 136
Fiona Morgan Avatar answered Dec 12 '25 21:12

Fiona Morgan