First, I'm using Cython 0.18 with Python 2.7.4. I'm experiencing a rather strange bug, and I'm not sure why. Here's the toy code:
from cpython cimport bool
cpdef unsigned int func(char *seq1, char *seq2, bool case_sensitive=True):
print 'seq1', seq1, len(seq1)
print 'seq2', seq2, len(seq2)
print
#take care of case sensitivity
if not case_sensitive:
#this is kinda hacky, but I've gotta assign the lowercased string to a Python object before assigning it back to char *
#see http://docs.cython.org/src/userguide/language_basics.html#caveats-when-using-a-python-string-in-a-c-context
temp = seq1.lower()
seq1 = temp
temp = seq2.lower()
seq2 = temp
print 'seq1', seq1, len(seq1)
print 'seq2', seq2, len(seq2)
print
#trim common characters at the beginning of the words
while len(seq1) > 0 and len(seq2) > 0 and seq1[0] == seq2[0]:
temp = seq1[1:]
seq1 = temp
temp = seq2[1:]
seq2 = temp
print 'seq1', seq1, len(seq1)
print 'seq2', seq2, len(seq2)
print
#handle degenerate cases
if not seq1:
return len(seq2)
if not seq2:
return len(seq1)
Here's a sample call:
>>> from func import func
>>> print func('TUESDAYs', 'tuesday', False)
Now, what I expect to see if the following:
seq1 TUESDAYs 8
seq2 tuesday 7
seq1 tuesdays 8
seq2 tuesday 7
seq1 s 1
seq2 0
1
But what I actually see is this:
seq1 TUESDAYs 8
seq2 tuesday 7
seq1 tuesdays 8
seq2 tuesday 7
seq1 stdout 6
seq2 tuesday 7
0
What the hell is going on here? First of all, why is stdout outputting at all? Why aren't I getting the output I should be getting? Is this a Cython bug, or am I just missing something trivial here?
The problem is in all of the cases like this:
temp = seq1.lower()
seq1 = temp
temp = seq2.lower()
The reason you need to do this dance instead of just seq1 = seq1.lower()—as you pointed out in your question—is because of Caveats when using a Python string in a C context.
But what you're doing isn't correct, it's just good enough to trick Cython into thinking it's correct, and compiling garbage.
Let's step through line by line:
temp = seq1.lower()
This creates a str out of seq1, calls its lower(), and stores the result in temp.
seq1 = temp
This makes seq1 into a pointer to the internal buffer of the str object in temp. As the docs specifically say:
It is then your responsibility to hold the reference p for as long as necessary.
temp = seq2.lower()
This yadda-yadda-yaddas, and stores the result in temp. As a consequence, it frees the old value of temp. Which was the only reference that you had to that str. So, the GC is free to collect it, and does so immediately. Which means seq1 is now pointing at the internal buffer of a freed object.
The first two times, you apparently get lucky, and that buffer doesn't get reused. But eventually, in the while loop, it fails, the buffer gets reused, and you end up with a pointer to some other string buffer.
So, how do you solve this?
Well, you could keep all those intermediate references around as long as they're needed.
But really, why do you need seq1 and seq2 to be char* values anyway? You're not getting any performance benefit out of it. In fact, you're getting an extra performance cost out of it. Every time you use seq1 as a str, it's creating a new str object out of that buffer (and copying the buffer), even though you already had a perfectly good one that you could have just retained instead, if you hadn't tricked Cython.
So, the easiest fix is to replace the first line with:
cpdef unsigned int func(char *sequence1, char *sequence2, bool case_sensitive=True):
seq1, seq2 = str(sequence1), str(sequence2)
(You don't really need the str calls there; the fact that you didn't cdef the variables should be enough. But I think this makes the intention clearer.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With