An obscure lock-up between two threads seems to be related to the Global Interpreter Lock or some other "behind-the-scenes-lock", and I am at a loss how to continue troubleshooting. Any tips how to eliminate the lock-up would be appreciated.
The issue reproduces (erratically and somewhat random) in a larger set of code. The code is strictly python. Python version is 2.6.5 (on Linux). Hours of troubleshooting has reduced the problem when the lock-up occurs to the following:
The offensive call in #5 is the function unicode.encode, which should be non-blocking. The following code in Thread 1 at the position where the thread locks will (as expected) print 'A' and 'B':
print('A')
print('B')
However, the following will just print 'A' and block the thread:
print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')
This makes no sense to me at all. There is no logical dead-lock condition existing between the two threads. Thread 1 is being blocked by a non-blocking library call that is not in any way interfering with Thread 2, which is just silently waiting to acquire an RLock. The only cause I can think of for Thread 1 being blocked is that it is waiting for the GIL.
Any thoughts how to troubleshoot this further, or any mechanisms to somehow control or manipulate GIL operation as a workaround?
Edit: Some additional information in response to samplebias (and thank you for the reply). I had problems getting a trace as the issue seems to be very sensitive to anything that may disturb the timing between the two threads. However running strace with only the -f option, after a few iterations I was able to get a trace.
Thread 1 contains these three debug statements which should print two lines 'CHECK_IN' and 'CHECK_TEST' to the console:
print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG
Here is the last page of the strace:
8605  mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605  mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605  mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605  clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606  set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605  futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... set_robust_list resumed> )   = 0
8606  futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  gettimeofday( <unfinished ...>
8606  <... futex resumed> )             = 0
8605  <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... futex resumed> )             = 0
8606  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606  fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606  mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606  write(1, "CHECK IN\n", 9)         = 9
8606  futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL
And the output of the three lines of code is just the following, before the program locks up:
CHECK IN
So the strace shows how Thread 1 (#8606) writes the 'CHECK_IN' string, and when reaching the unicode.encode call enters a waiting state which never returns.
By the way I am making a couple future imports in all modules to keep with some newer python conventions ...
from __future__ import print_function, unicode_literals
... but I cannot see they should make any difference - especially as the u'hello' string is called out explicitly as a unicode string.
The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter. This means that only one thread can be in a state of execution at any point in time.
In other words, the GIL is prevents corruption of state. Python programs should never produce a segmentation fault, because only memory safe operations are permitted. The GIL extends this assurance to multi-threaded programs.
ParallelPython: if you really need to scale, ParallelPython provides a mechanism for parallel execution of python code for multiple cores and clusters. Use a different Python implementation (both Jython and IronPython run without a GIL and the PyPy STM branch may also work for your use case).
I can find nothing in the Python source which would cause unicode.encode() to block, and the dummy program I wrote to try to reproduce this runs as expected. You mentioned that Thread 1 has acquired more than 1 lock - have you eliminated those locks as the source of the lock-up?  
Does the test case below show the same lock-up in your environment?
import time
import threading
def worker(tid):
    _lock.acquire()
    if not tid:
        # wait for rest of threads to enter acquire
        time.sleep(0.5)    
    print('%d: A' % tid)
    u'hello'.encode('utf-8')
    print('%d: B' % tid)
    _lock.release()
def start(tid):
    th = threading.Thread(target=worker, args=(tid,))
    th.start()
    return th
_num = 2
_lock = threading.RLock()
workers = [start(n) for n in range(_num)]
while all(w.isAlive() for w in workers):
    time.sleep(1)
Output:
0: A
0: B
1: A
1: B
You can also run strace on your program to work out where the process is blocked. For example with the above script:
% strace -fTr -o trace.out python lockup.py
The -o trace.out flag tells strace to write the output to a file. You can omit that and strace will print to stderr.
The contents of trace.out should show you all of the system calls made by the program, with each line prefixed with thread id and the relative time between system calls. The end of the line will have the time spent inside that system call. I've annotated the last few syscalls with the corresponding Python code:
# thread 0 time.sleep(0.5) completes
24778      0.500124 <... select resumed> ) = 0 (Timeout) <0.500599>
# preparing to print()
24778      0.000071 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000017>
24778      0.000058 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fe90a6000 <0.000018>
# print("0: A\n")..
24778      0.000079 write(1, "0: A\n", 5) = 5 <0.000023>
24778      0.000106 write(1, "0: B\n", 5) = 5 <0.000056>
# thread 0 _lock.release()
24778      0.000114 futex(0xe0f3c0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000024>
24778      0.000108 madvise(0x7f8fe7266000, 8368128, MADV_DONTNEED) = 0 <0.000030>
# thread 0 exit
24778      0.000072 _exit(0)            = ?
# thread 1 _lock.acquire()
24779      0.000050 <... futex resumed> ) = 0 <0.500774>
# thread 1 print("1: A\n") and so on..
24779      0.000052 write(1, "1: A\n", 5) = 5 <0.000026>
24779      0.000086 write(1, "1: B\n", 5) = 5 <0.000026>
24779      0.000099 madvise(0x7f8fe6a65000, 8368128, MADV_DONTNEED) = 0 <0.000024>
24779      0.000064 _exit(0)            = ?
24777      0.499956 <... select resumed> ) = 0 (Timeout) <1.001138>
24777      0.000132 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8fe8c7c8f0}, {0x4d9a90, [], SA_RESTORER, 0x7f8fe8c7c8f0}, 8) = 0 <0.000025>
# main thread process exit
24777      0.002349 exit_group(0)       = ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With