I am trying to understand CPU caches and cache lines using a C program, as I do for most C concepts. The program that I use is given below. I got the idea from the blog.
http://igoro.com/archive/gallery-of-processor-cache-effects/
Now the output of the below program on my machine is as given below. This is the output with CFLAGS="-g -O0 -Wall".
./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.
The output with optimization (CFLAGS=-g -O3 -Wall)
CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.
It is pointed out in the blog that
The first loop multiplies every value in the array by 3, and the second loop multiplies > only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.
This doesn't seem to be the case on my machine. As you can see, the time for the execution of
loop 1 is 0.46 seconds.
and that for
loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds
for different values of j.
Why is this happening?
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (64*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int j = 0;
/* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
It lives in a fixed-size buffer, set by the program executable's configuration according to the
operating system, but its actual size is seldom checked against the available space. */
/* int arr[MAX_SIZE]; */
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/* CPU clock ticks count start */
start = clock();
/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] *= 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
for (j = 8 ; j < 32 ; j++)
{
/* CPU clock ticks count start */
start = clock();
/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i] *= 3;
/* CPU clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
}
return 0;
}
I modified the code a little bit. First a summary of modifications:
malloc doesn't actually map in the memory into the process' address space, so in the first loop, we get some extra overhead for mapping in the memory). It also ensures that the CPU is running at "full speed" when performing the other loops, rather than being that "power-save" speed.j value more rapidly in the second loop by multiplying by 2 ( <<= 1 is the same as *= 2 in this case - old habit to use shift)+= 3 instead of *= 3. (Multiply is a little slower than +=, but makes little difference in this case.loop3 which does exactly the same number of operations as loop2, but over a small memory range [using & with a 2n-1 value to restrict the range].I compiled the code with gcc -Wall -O3 -sdc=c99, using version 4.6.3 and running on a quad-core Athlon 965, Fedora Core 16 x86-64 and 16 GB of RAM.
Here's the code:
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (512*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int j = 0;
/* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
It lives in a fixed-size buffer, set by the program executable's configuration according to the
operating system, but its actual size is seldom checked against the available space. */
/* int arr[MAX_SIZE]; */
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/* CPU clock ticks count start */
for(int k = 0; k < 3; k++)
{
start = clock();
/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] += 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
}
for (j = 1 ; j <= 1024 ; j <<= 1)
{
/* CPU clock ticks count start */
start = clock();
/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i] += 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
}
// Third loop, performing the same operations as loop 2,
// but only touching 16KB of memory
for (j = 1 ; j <= 1024 ; j <<= 1)
{
/* CPU clock ticks count start */
start = clock();
/* Loop 3 */
for (i = 0; i < MAX_SIZE; i += j)
arr[i & 0xfff] += 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
}
return 0;
}
And the results:
CPU time for loop 1 2.950000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 2 (j = 1) 0.780000 secs.
CPU time for loop 2 (j = 2) 0.700000 secs.
CPU time for loop 2 (j = 4) 0.610000 secs.
CPU time for loop 2 (j = 8) 0.540000 secs.
CPU time for loop 2 (j = 16) 0.560000 secs.
CPU time for loop 2 (j = 32) 0.280000 secs.
CPU time for loop 2 (j = 64) 0.140000 secs.
CPU time for loop 2 (j = 128) 0.090000 secs.
CPU time for loop 2 (j = 256) 0.060000 secs.
CPU time for loop 2 (j = 512) 0.030000 secs.
CPU time for loop 2 (j = 1024) 0.040000 secs.
CPU time for loop 3 (j = 1) 0.470000 secs.
CPU time for loop 3 (j = 2) 0.240000 secs.
CPU time for loop 3 (j = 4) 0.120000 secs.
CPU time for loop 3 (j = 8) 0.050000 secs.
CPU time for loop 3 (j = 16) 0.030000 secs.
CPU time for loop 3 (j = 32) 0.020000 secs.
CPU time for loop 3 (j = 64) 0.010000 secs.
CPU time for loop 3 (j = 128) 0.000000 secs.
CPU time for loop 3 (j = 256) 0.000000 secs.
CPU time for loop 3 (j = 512) 0.000000 secs.
CPU time for loop 3 (j = 1024) 0.000000 secs.
As you can see, the first few of the loop2 takes the same amount of time - once we get to 32 the time starts to go down, because the processor doesn't need EVERY cache-line, but in the loop3 case, the number of operations in each loop affects the total time rather directly.
Edit:
The multiply (*=3) vs add (+=3) doesn't really make that much difference EXCEPT in the loop3 case, where it adds approximately 30% on the loop time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With