unexpected output and return value with Cython

Question

First, I'm using Cython 0.18 with Python 2.7.4. I'm experiencing a rather strange bug, and I'm not sure why. Here's the toy code:

from cpython cimport bool

cpdef unsigned int func(char *seq1, char *seq2, bool case_sensitive=True):
        print 'seq1', seq1, len(seq1)
        print 'seq2', seq2, len(seq2)
        print

        #take care of case sensitivity
        if not case_sensitive:
                #this is kinda hacky, but I've gotta assign the lowercased string to a Python object before assigning it back to char *
                #see http://docs.cython.org/src/userguide/language_basics.html#caveats-when-using-a-python-string-in-a-c-context
                temp = seq1.lower()
                seq1 = temp

                temp = seq2.lower()
                seq2 = temp

        print 'seq1', seq1, len(seq1)
        print 'seq2', seq2, len(seq2)
        print

        #trim common characters at the beginning of the words
        while len(seq1) > 0 and len(seq2) > 0 and seq1[0] == seq2[0]:
                temp = seq1[1:]
                seq1 = temp

                temp = seq2[1:]
                seq2 = temp

        print 'seq1', seq1, len(seq1)
        print 'seq2', seq2, len(seq2)
        print

        #handle degenerate cases
        if not seq1:
                return len(seq2)
        if not seq2:
                return len(seq1)

Here's a sample call:

>>> from func import func
>>> print func('TUESDAYs', 'tuesday', False)

Now, what I expect to see if the following:

seq1 TUESDAYs 8
seq2 tuesday 7

seq1 tuesdays 8
seq2 tuesday 7

seq1 s 1
seq2  0

1

But what I actually see is this:

seq1 TUESDAYs 8
seq2 tuesday 7

seq1 tuesdays 8
seq2 tuesday 7

seq1 stdout 6
seq2 tuesday 7

0

What the hell is going on here? First of all, why is stdout outputting at all? Why aren't I getting the output I should be getting? Is this a Cython bug, or am I just missing something trivial here?

abarnert · Accepted Answer

The problem is in all of the cases like this:

temp = seq1.lower()
seq1 = temp

temp = seq2.lower()

The reason you need to do this dance instead of just seq1 = seq1.lower()—as you pointed out in your question—is because of Caveats when using a Python string in a C context.

But what you're doing isn't correct, it's just good enough to trick Cython into thinking it's correct, and compiling garbage.

Let's step through line by line:

temp = seq1.lower()

This creates a str out of seq1, calls its lower(), and stores the result in temp.

seq1 = temp

This makes seq1 into a pointer to the internal buffer of the str object in temp. As the docs specifically say:

It is then your responsibility to hold the reference p for as long as necessary.

temp = seq2.lower()

This yadda-yadda-yaddas, and stores the result in temp. As a consequence, it frees the old value of temp. Which was the only reference that you had to that str. So, the GC is free to collect it, and does so immediately. Which means seq1 is now pointing at the internal buffer of a freed object.

The first two times, you apparently get lucky, and that buffer doesn't get reused. But eventually, in the while loop, it fails, the buffer gets reused, and you end up with a pointer to some other string buffer.

So, how do you solve this?

Well, you could keep all those intermediate references around as long as they're needed.

But really, why do you need seq1 and seq2 to be char* values anyway? You're not getting any performance benefit out of it. In fact, you're getting an extra performance cost out of it. Every time you use seq1 as a str, it's creating a new str object out of that buffer (and copying the buffer), even though you already had a perfectly good one that you could have just retained instead, if you hadn't tricked Cython.

So, the easiest fix is to replace the first line with:

cpdef unsigned int func(char *sequence1, char *sequence2, bool case_sensitive=True):
    seq1, seq2 = str(sequence1), str(sequence2)

(You don't really need the str calls there; the fact that you didn't cdef the variables should be enough. But I think this makes the intention clearer.)

unexpected output and return value with Cython

Tags:

python

python-2.7

cython

Geoff

1 Answers

abarnert

Recent Activity

Donate For Us

unexpected output and return value with Cython

Tags:

python

python-2.7

cython

Geoff

1 Answers

abarnert

Related questions

Recent Activity

Donate For Us