I am using pthread condition variable as a synchronization primitive in a ping-pong test. The ping-pong test consists of two threads execute alternatively. Each thread writes to the other thread's memory and wake it up using signal, then wait and sleep on its own memory which will be written by the other thread later. Here is my first version. It works fine when I loop this ping-pong test for 10,000 times, but when I change to 100,000 times, it will hang occasionally. N=1,000,000 will make it hang definitely. I tried to debug and print out the loop number for each loop, but the program never hangs again after I add the print statement, which is annoying. Here is the ping-pong test code:
for(i=0; i<N+1; i++)
{
if(i==N)
{
pthread_cond_signal(&cond[dest]);
break;
}
pthread_mutex_lock(&mutex[dest]);
messages[dest]=my_rank;
pthread_cond_signal(&cond[dest]);
pthread_mutex_unlock(&mutex[dest]);
pthread_mutex_lock(&mutex[my_rank]);
while(pthread_cond_wait(&cond[my_rank], &mutex[my_rank]) && messages[my_rank]!=dest);
messages[my_rank]=my_rank;
pthread_mutex_unlock(&mutex[my_rank]);
printf("rank=%ld i=%ld messages[%ld]=%ld\n", my_rank, i, my_rank, messages[my_rank]);
}
Then I tried a second version which works and never hangs, even I set N to 1,000,000. I changed from two mutexes to one mutex which is shared by the two condition variables. I am not sure if it is the right way to go but this one never hangs again. Here is the code:
for(i=0; i<N+1; i++)
{
if(i==N)
{
pthread_cond_signal(&cond[dest]);
break;
}
pthread_mutex_lock(&mutex[0]);
messages[dest]=my_rank;
pthread_cond_signal(&cond[dest]);
while(pthread_cond_wait(&cond[my_rank], &mutex[0]) && messages[my_rank]!=dest);
messages[my_rank]=my_rank;
pthread_mutex_unlock(&mutex[0]);
}
I am very confused. Could somebody help me explain why the first version hangs but the second version does not? Is it correct for two condition variable sharing a single mutex?
Thanks.
Thanks for everyone, especially caf. Here is my final code that works without hanging.
for(i=0; i<N+1; i++)
{
pthread_mutex_lock(&mutex[dest]);
messages[dest]=my_rank;
pthread_cond_signal(&cond[dest]);
pthread_mutex_unlock(&mutex[dest]);
if(i!=N)
{
pthread_mutex_lock(&mutex[my_rank]);
while(messages[my_rank]!=dest)
pthread_cond_wait(&cond[my_rank], &mutex[my_rank]);
messages[my_rank]=my_rank;
pthread_mutex_unlock(&mutex[my_rank]);
}
}
The problem is in this line:
while (pthread_cond_wait(&cond[my_rank], &mutex[my_rank]) && messages[my_rank]!=dest);
If the 'dest' thread gets scheduled after you unlock mutex[dest] and before you lock mutex[my_rank], it will set messages[my_rank] and signal the condition variable, before this thread calls pthread_cond_wait(), so this thread will wait forever.
The fix for this is very easy: test messages[my_rank] before waiting on the condition variable. You also don't want && here, because you always want to keep looping as long as messages[my_rank] != dest is true - you don't want to break out at the first non-zero return from pthread_cond_wait(). So if you want to ignore errors from pthread_cond_wait() (as your original does, and this is perfectly fine if you aren't using error-checking or robust mutexes, since those are the only times pthread_cond_wait() is allowed to fail), use:
while (messages[my_rank] != dest)
pthread_cond_wait(&cond[my_rank], &mutex[my_rank]);
The reason that your alternate version doesn't have this bug is because the lock is continuously held between signalling the dest thread and waiting on the condition variable, so the dest thread doesn't get a chance to run until we're definitely waiting.
As for your supplementary question:
Is it correct for two condition variable sharing a single mutex?
Yes, this is allowed, but the converse is not (you cannot have two threads waiting on the same condition variable at the same time, using different mutexes).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With