Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading binary files, Linux Buffer Cache

Tags:

c++

linux

io

I am busy writing something to test the read speeds for disk IO on Linux.

At the moment I have something like this to read the files:

Edited to change code to this:

  const int segsize = 1048576;
  char buffer[segsize];
  ifstream file;
  file.open(sFile.c_str());
  while(file.readsome(buffer,segsize)) {}

For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes. However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.

The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.

Edited to Add: Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.

Thanks for the answers.

like image 873
Salgar Avatar asked Mar 19 '26 08:03

Salgar


2 Answers

On Linux, you can do this for a quick troughput test:

$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s

$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s

$ sync && echo 3 > /proc/sys/vm/drop_caches

$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s

echo 3 > /proc/sys/vm/drop_caches will flush the cache properly

like image 183
Nicolas Viennot Avatar answered Mar 20 '26 21:03

Nicolas Viennot


  • in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.

  • unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.

  • C++0x Stream Positioning may be interesting to you if you are using large files

like image 41
AProgrammer Avatar answered Mar 20 '26 22:03

AProgrammer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!