Introduction
I have a C++ process called MyProcess that I call nbLines times, where nbLines is the number of lines of a big file called InputDataFile.txt in which input data are to be found. For example the call
./MyProcess InputDataFile.txt 142
Inform MyProcess that the input data are to be found at the line 142 of the InputDataFile.txt file.
Issue
The issue is that InputDataFile.txt is so big (~ 150 GB) that the time for searching the correct line is not negligible. Inspired form this post, here is my (possibly not optimal) code
int line = 142;
int N = line - 1;
std::ifstream inputDataFile(filename.c_str());
std::string inputData;
for(int i = 0; i < N; ++i)
    std::getline(inputDataFile, inputData);
std::getline(inputDataFile,inputData);
Goal
My goal is to make the search of inputData faster for MyProcess.
Possible solution
It would be handy to match once the index of the first character of every line with the line number in bash. This way instead of giving 142 to MyProcess, I could give directly the index of the first character of interest. MyProcess could then directly jump to this position without having to search and count the '\n' characters. It would then read the data until a '\n' character is encounter. Is something like this feasible? How could this be implemented?
Of course, I welcome any other solution that would reduce the overall computational time for importing those input data.
As Suggested in other answers it could be a good idea to build a map of the file. The way I would do this (in pseudocode) would be:
let offset be a unsigned 64 bit int =0;
for each line in the file 
    read the line
    write offset to a binary file (as 8 bytes rather as chars)
    offset += length of line in bytes
Now you have a "Map" file that is a list of 64 bit ints (one for each line in the file). To read the map you just compute where in the map the entry for the line you desire is located:
offset = desired_line_number * 8 // where line number starts at 0
offset2 = (desired_line_number+1) * 8
data_position1 = load bytes [offset through offset + 8] as a 64bit int from map
data_position2 = load bytes [offset2 through offset2 + 8] as a 64bit int from map
data = load bytes[data_position1 through data_position2-1] as a string from data.
The idea is that you read through the data file once and record the byte offset in the file where each line starts and then store the offsets sequentially in a binary file using a fixed size integer type. The map file should then have a size of number_of_lines * sizeof(integer_type_used). You then just have to seek into the map file by calculating the offset of where you stored the line number offset and read that offset as well as the next lines offset. From there you have a numerical range in bytes of where your data should be located.
Example:
Data:
hello\n 
world\n
(\n newline at end of file)
Create map.
Map: each grouping [number] will represent an 8 byte length in the file
[0][7][14]
//or in binary
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001110
Now say I want line 2:
line offset = 2-1 * 8 // offset is 8 
So since we are using a base 0 system that would be the 9th byte in the file. So out number is made up of bytes 9 - 17 which are :
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
//or as decimal
7
So now we know that out line should start at offset 7 in our data file (This offset is base 1, it would be 6 if we started counting at 0).
We then do the same process to get the start offset of the next line which is 14.
Finally we look up the byte range 7-14 (base 1, 6-13 base 0) and store that as a string and get world\n.
C++ implementation:
#include <iostream>
#include <fstream>
int main(int argc, const char * argv[]) {
    std::string filename = "path/to/input.txt";
    std::ifstream inputFile(filename.c_str(),std::ios::binary);
    std::ofstream outfile("path/to/map/file.bin",std::ios::binary|std::ios::ate);
    if (!inputFile.is_open() || !outfile.is_open()) {
        //use better error handling than this
        throw std::runtime_error("Error opening files");
    }
    std::string inputData;
    std::size_t offset = 0;
    while(std::getline(inputFile, inputData)){
        //write the offset as binary
        outfile.write((const char*)&offset, sizeof(offset));
        //increment the counter
        offset+=inputData.length()+2;
        //add one becuase getline strips the \n and add one to make the index represent the next line
    }
    outfile.close();
    offset=0;
    //from here on we are reading the map
    std::ifstream inmap("/Users/alexanderzywicki/Documents/xcode/textsearch/textsearch/map",std::ios::binary);
    std::size_t line = 2;//your chosen line number
    std::size_t idx = (line-1) * sizeof(offset); //the calculated offset
    //seek into the map
    inmap.seekg(idx);
    //read the binary at that location
    inmap.read((char*)&offset, sizeof(offset));
    std::cout<<offset<<std::endl;
    //from here you just need to lookup from the data file in the same manor
    return 0;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With