Why 4 process better than 4 thread?

Question

void task1(void* arg) {
    static volatile long res = 1;
    for (long i = 0; i < 100000000; ++i) {
        res ^= (i + 1) * 3 >> 2;
    }
}

4 threads, working simultaneously, perform task1 193 times in 30 seconds. But 4 process, working simultaneously, perform task1 348 times in 30 seconds. Why such a big difference? I tested it on [Mac OS X 10.7.5, Intel Core i5 (4 logical cores)]. Think, that the same difference in Windows and Linux.

David Rodríguez - dribeas · Accepted Answer

The res variable is static, which means that it is shared by all of the threads in the same process. This means that in the case of four threads, each modification of the res variable in one thread has to be made available to the other threads, which usually involves some sort of locking on the bus, invalidation of the level 1 cache and reload in all other cpus.

In the case of four processes, the variable is not really shared by the different processes, so they can truly run in parallel without interfering on each other.

Note that the main difference is not thread/process, but the fact that in one case everyone accesses the same variable while in the other they access different ones. Also, in the case of threads, the real issue is not the performance, but the fact that the final result will probably be incorrect:

res ^= x;

That is not an atomic operation, the processor will load the old value of res, then it will modify it in a register and write it back. Without synchronization primitives, multiple threads can load the same value, modify it independently and write back to the variable, in which case the work of some of the threads will be overwritten by the others. The end result will depend on the execution pattern of the different threads, not on the code of your program.

To simulate the non-sharing of the variables you will need to make sure that in the threads access different cache-lines. The simplest change is to drop the static qualifier from the variable, so that each thread will update a variable inside it's own stack, which will be in a different memory address than the variables of the other threads, and hopefully map to a different cache line. Another option is creating the four variables together, but adding enough padding between them so that they are spread to different cache lines:

struct padded_long {
    volatile unsigned long res;
    char [CACHE_LINE_SIZE - sizeof(long)]; // Find this in your processor documentation
};
void f(void *) {
   static padded_long res[4];
   // detect which thread is running based on the argument and use res[0]..res[3]
   // for the different threads

Mats Petersson · Answer

This is one variable for all threads within one process:

static volatile long res = 1;

So if you only run one thread in each of four processes, you have four different "res" that lives in different bits of memory. In the threading case, the "res" is the same variable for all four threads, so each time it gets updated, the other three processors have to invalidate (get rid of) its copy, and fetch a new from the processor that last updated it. This slows everything down. And if you are actually wanting to update a variable per thread, I would suggest doing something like this:

void task1(void* arg) {
    volatile long* res = const_cast<volatile long *>(
           reinterpret_cast<long *>(arg));
    for (long i = 0; i < 100000000; ++i) {
        res ^= (i + 1) * 3 >> 2;
    }
}

and pass in a different long from a different section of memory (e.g. use new long to generate a unique address per thread).

Why 4 process better than 4 thread?

Tags:

c++

operating-system

multithreading

intel

dizel3d

2 Answers

David Rodríguez - dribeas

Mats Petersson

Recent Activity

Donate For Us

Why 4 process better than 4 thread?

Tags:

c++

operating-system

multithreading

intel

dizel3d

2 Answers

David Rodríguez - dribeas

Mats Petersson

Related questions

Recent Activity

Donate For Us