I have a server with a RAID50 configuration of 24 drives (two groups of 12), and if I run:
dd if=/dev/zero of=ddfile2 bs=1M count=1953 oflag=direct
I get:
2047868928 bytes (2.0 GB) copied, 0.805075 s, 2.5 GB/s
But if I run:
dd if=/dev/zero of=ddfile2 bs=1M count=1953
I get:
2047868928 bytes (2.0 GB) copied, 2.53489 s, 808 MB/s
I understand that O_DIRECT causes the page cache to be bypassed. But as I understand it bypassing the page cache basically means avoiding a memcpy. Testing on my desktop with the bandwidth tool I have a worst case sequential memory write bandwidth of 14GB/s, and I imagine on the newer much more expensive server the bandwidth must be even better. So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache? Is this atypical?
In the oflag=direct case:
More generally, that giant block size (1 MByte) is likely bigger than the RAID's block size so the I/O will be split up within the kernel and those smaller pieces submitted in parallel, thus big enough that the coalescing you get from buffered writeback with tiny I/Os won't be worth much (the exact point that the kernel will start splitting I/Os depends on a number of factors. Further, while RAID stripe sizes can be larger than 1 MByte, the kernel isn't always aware of this for hardware RAID. In the case of software RAID the kernel can sometimes optimize for stripe size - e.g. the kernel I'm on knows the md0 device has a 4 MByte stripe size and express a hint that it prefers I/O in that size via /sys/block/md0/queue/optimal_io_size).
Given all the above, IF you were maxing out a single CPU during the original buffered copy AND your workload doesn't benefit much from caching/coalescing BUT the disk could handle more throughput THEN doing the O_DIRECT copy should go faster as there's more CPU time available for userspace/servicing disk I/Os due to the reduction in kernel overhead.
So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache?
It's not just an extra memcpy per I/O that is involved - think about all the extra cache machinery that must be maintained. There is a nice explanation about how copying a buffer to the kernel isn't instantaneous and how page pressure can slow things down in an answer to the Linux async (io_submit) write v/s normal (buffered) write question. However, unless your program can generate data fast enough AND the CPU is so overloaded it can't feed the disk quickly enough then it usually doesn't show up or matter.
Is this atypical?
No, your result is quite typical with the sort of workload you were using. I'd imagine it would be a very different outcome if the blocksize were tiny (e.g. 512 bytes) though.
Let's compare some of fio's output to help us understand this:
$ fio --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M_no_fsync
buffered_1M_no_fsync: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2511MiB/s][r=0,w=2510 IOPS][eta 00m:00s]
buffered_1M_no_fsync: (groupid=0, jobs=1): err= 0: pid=25408: Sun Aug 25 09:10:31 2019
  write: IOPS=2100, BW=2100MiB/s (2202MB/s)(20.0GiB/9752msec)
[...]
  cpu          : usr=2.08%, sys=97.72%, ctx=114, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
So using buffering we wrote at about 2.1 GBytes/s but used up a whole CPU to do so. However, the block device (md0) says it barely saw any I/O (ios=0/3 -  only three write I/Os) which likely means most of the I/O was cached in RAM! As this particular machine could easily buffer 20 GBytes in RAM we shall do another run with end_fsync=1 to force any data that may only have been in the kernel's RAM cache at the end of the run to be pushed to disk thus ensuring we record the time it took for all the data to actually reach non-volatile storage:
$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M
buffered_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]      
buffered_1M: (groupid=0, jobs=1): err= 0: pid=41884: Sun Aug 25 09:13:01 2019
  write: IOPS=1928, BW=1929MiB/s (2023MB/s)(20.0GiB/10617msec)
[...]
  cpu          : usr=1.77%, sys=97.32%, ctx=132, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/40967, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/2561, aggrmerge=0/2559, aggrticks=0/132223, aggrin_queue=127862, aggrutil=21.36%
OK now the speed has dropped to about 1.9 GBytes/s and we still use all a CPU but the disks in the RAID device claim they had capacity to go faster (aggrutil=21.36%). Next up direct I/O:
$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_1M 
direct_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=3242MiB/s][r=0,w=3242 IOPS][eta 00m:00s]
direct_1M: (groupid=0, jobs=1): err= 0: pid=75226: Sun Aug 25 09:16:40 2019
  write: IOPS=2252, BW=2252MiB/s (2361MB/s)(20.0GiB/9094msec)
[...]
  cpu          : usr=8.71%, sys=38.14%, ctx=20621, majf=0, minf=83
[...]
Disk stats (read/write):
    md0: ios=0/40966, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/5120, aggrmerge=0/0, aggrticks=0/1283, aggrin_queue=1, aggrutil=0.09%
Going direct we use just under 50% of a CPU to do 2.2 GBytes/s (but notice how I/Os weren't merged and how we did far more userspace/kernel context switches). If we were to push more I/O per syscall things change:
$ fio --bs=4M --size=20G --rw=write --filename=zeroes --name=buffered_4M_no_fsync
buffered_4M_no_fsync: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2390MiB/s][r=0,w=597 IOPS][eta 00m:00s]
buffered_4M_no_fsync: (groupid=0, jobs=1): err= 0: pid=8029: Sun Aug 25 09:19:39 2019
  write: IOPS=592, BW=2370MiB/s (2485MB/s)(20.0GiB/8641msec)
[...]
  cpu          : usr=3.83%, sys=96.19%, ctx=12, majf=0, minf=1048
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%
$ fio --end_fsync=1 --bs=4M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_4M
direct_4M: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=5193MiB/s][r=0,w=1298 IOPS][eta 00m:00s]
direct_4M: (groupid=0, jobs=1): err= 0: pid=92097: Sun Aug 25 09:22:39 2019
  write: IOPS=866, BW=3466MiB/s (3635MB/s)(20.0GiB/5908msec)
[...]
  cpu          : usr=10.02%, sys=44.03%, ctx=5233, majf=0, minf=12
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%
With a massive block size of 4 MBytes buffered I/O became bottlenecked at "just" 2.3 GBytes/s (even when we didn't force the cache to be flushed) due to the fact that there's no CPU left. Direct I/O used around 55% of a CPU and managed to reach 3.5 GBytes/s so it was roughly 50% faster than buffered I/O.
Summary: Your I/O pattern doesn't really benefit from buffering (I/Os are huge, data is not being reused, I/O is streaming sequential) so you're in an optimal scenario for O_DIRECT being faster. See these slides by the original author of Linux's O_DIRECT (longer PDF document that contains an embedded version of most of the slides) for the original motivation behind it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With