I have a system where two programs are concurrently reading and writing a file containing a large number of pairs of unsigned integers. The key can have a value [0,2^16) and the value can have a value [0,2^32). I am currently storing these integers in the file as characters, where each pair is on its own line, the key and value are separated by a single whitespace, and all values have 10 characters.
The reason for the 10 characters is two-fold: 1. the largest unsigned integer, interpreted as characters, is 10 characters long, and 2. I am using mmap to map the file into both programs' memory, using the MAP_SHARED flag. When the program that writes to the file is first run, it will linearly scan the file and construct a hash map mapping the keys to pointers to the values. When the writer wants to put a new key-value pair into the file, it will search for the key in the hash map, and if that key exists, it will overwrite the value stored in the address associated with that key. I ensure that all values are stored as 10 characters in the file, where smaller values are padded on the right with whitespaces, so that the writer can always simply overwrite the value at that address and not have to do some crazy shuffling of memory and munmap and mmap again.
However, storing integers as strings in a file is extremely inefficient, and I would like to know how to more efficiently do three things:
Store the pairs of integers as, well, pairs of integers and not strings. I want the file to be as small as possible.
Not have to use an additional large data structure like the hash map in the writing program.
Allow the reading program to search for and retrieve a key-value pair in log(n) time, instead of what it is doing now, which is reading through the file linearly for a matching string.
In general, can a file be structured to behave like a data structure that can be sorted by something like binary search, and if so, how would I go about doing that?
Data Storage: I'd suggest that you store your data in binary form, this will somewhat reduce the file size. To keep platform-independence, you should use something like msgpack.
Data Access: I think mmap() is a good idea (especially if you use a cross-platform solution like boost memory-mapped file)
Search: Since mmap()ing will give some pointer to handle your data, you should be able to use it as a raw C-style array or wrap it in an array_view and use it with the <algorithm> header facilities.
I ended up implementing a solution that I was looking for, which does the following things:
Key-value pairs are stored in the file in binary format, 4 bytes for the key followed by 4 bytes for the value.
The file contains only key-value pairs, so the file is just a stream of key-value pairs with no delimiters or extra fluff.
Both the reading and writing programs can search for a key and retrieve the corresponding value in log(n) time. This was achieved by reinterpreting the binary as an array of 8 byte chunks, reinterpreting each 8 byte chunk as two 4 byte chunks (key and value), and performing binary search on the mapped file.
The code I came up with is as follows:
struct Pair { uint32_t index[]; };
struct PairArray { uint64_t index[]; };
size_t getFilesize(const char* filename) {
struct stat st;
stat(filename, &st);
return st.st_size;
}
void binarySearch(const PairArray* const pairArray,
uint16_t numElements, uint32_t key, uint32_t*& value) {
int mid = numElements/2;
if (numElements == 0) return;
// interpret the pair as an array of 4 byte key and value
const Pair* pair = reinterpret_cast<const Pair*>(&(pairArray->index[mid]));
// new pointer to pass into recursive call
const PairArray* const newPairArray = reinterpret_cast<const PairArray* const>(
&(pairArray->index[mid + 1]));
// if key is found, point pointer passed by reference to value
if (key == pair->index[0]) {
value = const_cast<uint32_t*>(&pair->index[1]);
return;
}
// if search key is less than current key, binary search on left subarray
else if (key < pair->index[0]) {
binarySearch(pairArray, mid, key, value);
}
// otherwise, binary search on right subarray
else (numElements%2 == 0)
? binarySearch(newPairArray, mid - 1, key, value)
: binarySearch(newPairArray, mid, key, value);
}
int main(int argc, char** argv) {
...
// get size of the file
size_t filesize = getFilesize(argv[1]);
// open file
int fd = open(argv[1], O_RDWR, 0);
if (fd < 0) {
std::cerr << "error: file could not be opened" << std::endl;
exit(EXIT_FAILURE);
}
// execute mmap:
char* mmappedData = static_cast<char*>(
mmap(NULL, filesize, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0));
if (mmappedData == NULL) {
std::cerr << "error: could not memory map file" << std::endl;
exit(EXIT_FAILURE);
}
// interpret the memory mapped file as an array of 8 byte pairs
const PairArray* const pairArray = reinterpret_cast<PairArray*>(mmappedData);
// spin until file is unlocked, and take lock for yourself
while(true) {
int gotLock = flock(fd, LOCK_SH);
if (gotLock == 0) break;
}
// binary search for key value pair
uint32_t* value = nullptr;
binarySearch(pairArray, filesize/8, key, value);
(value == nullptr)
? std::cout << "null" << std::endl
: std::cout << *value << std::endl;
// release lock
flock(fd, LOCK_UN);
...
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With