In my application I'm trying to merge sorted files (keeping them sorted of course), so I have to iterate through each element in both files to write the minimal to the third one. This works pretty much slow on big files, as far as I don't see any other choice (the iteration has to be done) I'm trying to optimize file loading. I can use some amount of RAM, which I can use for buffering. I mean instead of reading 4 bytes from both files every time I can read once something like 100Mb and work with that buffer after that, until there will be no element in buffer, then I'll refill the buffer again. But I guess ifstream is already doing that, will it give me more performance and is there any reason? If fstream does, maybe I can change size of that buffer?
added
My current code looks like that (pseudocode)
// this is done in loop
int i1 = input1.read_integer();
int i2 = input2.read_integer();
if (!input1.eof() && !input2.eof())
{
   if (i1 < i2)
   {
      output.write(i1);
      input2.seek_back(sizeof(int));
   } else
      input1.seek_back(sizeof(int));
      output.write(i2);
   }
} else {
   if (input1.eof())
      output.write(i2);
   else if (input2.eof())
      output.write(i1);
}
What I don't like here is
Can you suggest improvement for that?
Thanks.
Without getting into the discussion on stream buffers, you can get rid of the seek_back and generally make the code much simpler by doing:
using namespace std;
merge(istream_iterator<int>(file1), istream_iterator<int>(),
           istream_iterator<int>(file2), istream_iterator<int>(),
           ostream_iterator<int>(cout));
Added binary capability
#include <algorithm>
#include <iterator>
#include <fstream>
#include <iostream>
struct BinInt
{
    int value;
    operator int() const { return value; }
    friend std::istream& operator>>(std::istream& stream, BinInt& data)
    {
        return stream.read(reinterpret_cast<char*>(&data.value),sizeof(int));
    }
};
int main()
{
    std::ifstream   file1("f1.txt");
    std::ifstream   file2("f2.txt");
    std::merge(std::istream_iterator<BinInt>(file1), std::istream_iterator<BinInt>(),
               std::istream_iterator<BinInt>(file2), std::istream_iterator<BinInt>(),
               std::ostream_iterator<int>(std::cout));
}
In decreasing order of performance (best first):
ReadFile or read calls.fread into a large bufferifstream.read into a large bufferifstream and extractorsA program like this should be I/O bound, meaning it should be spending at least 80% of it's time waiting for completion of reading or writing a buffer, and if the buffers are reasonably big, it should be keeping the disk heads busy. That's what you want.
Don't assume it is I/O bound, without proof. A way to prove it is by taking several stackshots. If it is, most of the samples will show the program waiting for I/O completion.
It is possible that it is not I/O bound, meaning you may find other things going on in some of the samples that you never expected. If so, then you know what to fix to speed it up. I have seen some code like this spending much more time than necessary in the merge loop, testing for end-of-file, getting data to compare, etc. for example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With