Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read 64bit int file with markers the fastest?

Tags:

c++

c++11

fstream

I'm trying to read some data from a file to fill the following data structures:

typedef uint64_t t_feat; 

typedef pair<vector<t_feat>, int> t_featfam;

Each log file contains several such families, so I want to save them all in a vector. The logfiles that have a very simple formatting:

line = "-": start a new family

line = "#": family ends here

line = 64bit unsigned integer (as string): add this value to the family

line = "!": mark the following integer as important (exactly one is marked like this in each family), the marking is done by setting the the second value of the family to the index of the important element

There are no mistakes in the files, so every ! is followed by an integer, all families start and end properly and there are no additional spaces or anything (only exception is a possible empty line at the file end).

Right now I'm using the following code:

void read_data_from_file(const string &fname, vector<t_featfam> &data)
{
    ifstream f;
    f.open(fname, ios::in);
    while (!f.eof())
    {
        string currentline;
        getline(f, currentline);
        if (currentline == "" || currentline == "#")
            continue;
        else if (currentline == "-")
            data.push_back(t_featfam());
        else if (currentline == "!")
            data.back().second = data.back().first.size();
        else
        {
            istringstream iss(currentline);
            t_feat value;
            iss >> value;
            data.back().first.push_back(value);
        }
    }
}

This works, but feels horribly inefficient and probably is... If it would be just numbers, I would certainly use only fstreams, but as it is, I'm not sure how to do that properly. Can anyone hint me the right direction? This should be possible somehow. I'm using Visual Studio, and don't mind VS-specific solutions, but don't wanna include boost.

edit2: now a really working version, using steves codes, and improved by the ideas from luk32... 4 times faster then aboves code...

void read_data_from_file(const string &fname, vector<t_featfam> &data)
{
    ifstream f;
    f.open(fname, ios::in);
    char* currentline = new char [30];
    while (!f.eof())
    {
        f.getline(currentline, 30);
        switch (currentline[0])
        {
        case '\0':
        case '#':
            break;
        case '-':
            data.push_back(t_featfam());
            break;
        case '!':
            data.back().second = data.back().first.size();
            break;
        default:
            data.back().first.push_back(stoull(currentline));
            break;
        }
    }
    delete currentline;
}
like image 202
Mahrgell Avatar asked Dec 29 '25 12:12

Mahrgell


2 Answers

I would probably do something along the lines of the following:

  • move currentline outside the loop - prevents reallocs every time around the loop
  • use a switch statement on the first char of currentline so we jump instead of multiple if/else statements
  • use std::stoull instead of a stringstream to convert currentline to uint64_t

Here's the function (not tested to see if it compiles, just wrote it up)

void read_data_from_file(const string &fname, vector<t_featfam> &data)
{
    ifstream f;
    f.open(fname, ios::in);
    string currentline;
    while (!f.eof())
    {
        getline(f, currentline);
        switch (currentline.c_str()[0])
        {
            case '\0':
            case '#':
                break;
            case '-':
                data.push_back(t_featfam());
                break;
            case '!':
                data.back().second = data.back().first.size();
                break;
            default:
                data.back().first.push_back(std::stoull(currentline));
                break;
        }
    }
}
like image 92
Steve Lorimer Avatar answered Jan 01 '26 03:01

Steve Lorimer


Most of the time is lost in the memory allocations. You have one allocation when you call getline() and another one when you construct the istringstream. Each allocation costs roughly 250 cycles on my system. So, you can save roughly 500 cycles per line that you read.

You can eliminate the allocations altogether if you use mmap() to map the entire file into your address space. Once you have everything in a single large array of char, you can relatively easily parse it without any need to copy lines out of that large array.

like image 42
cmaster - reinstate monica Avatar answered Jan 01 '26 03:01

cmaster - reinstate monica



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!