I have a large buffer of strings (basically 12GB) from a C app.
I would like to create PyString objects in C for an embedded Python interpreter without copying the strings. Is this possible?
I don't think that is possible for the basic reason that Python String objects are embedded into the PyObject structure. In other words, the Python string object is the PyObject_HEAD followed by the bytes of the string. You would have to have room in memory to put the PyObject_HEAD information around the existing bytes.
One can't use PyString without a copy, but one can use ctypes. Turns out that ctypes.c_char_p works basically like a string. For example with the following C code:
static char* names[7] = {"a", "b", "c", "d", "e", "f", "g"};                                      
PyObject *pFunc, *pArgs, *pValue;
pFunc = td_py_get_callable("my_func");
pArgs = PyTuple_New(2);
pValue = PyLong_FromSize_t((size_t) names);
PyTuple_SetItem(pArgs, 0, pValue);
pValue = PyLong_FromLong(7);
PyTuple_SetItem(pArgs, 1, pValue);
pValue = PyObject_CallObject(pFunc, pArgs);
One can then pass the address and the number of character strings With the following python my_func:
def my_func(names_addr, num_strs):
    type_char_p = ctypes.POINTER(ctypes.c_char_p)
    names = type_char_p.from_address(names_addr)
    for idx in range(num_strs):
        print(names[idx])
Of course who really wants to pass around a address and a length in Python. We can put these in a numpy array and pass around then cast if we need to use them:
def my_func(name_addr, num_strs):
    type_char_p = ctypes.POINTER(ctypes.c_char_p)
    names = type_char_p.from_address(names_addr)
    // Cast to size_t pointers to be held by numpy
    p = ctypes.cast(names, ctypes.POINTER(ctypes.c_size_t))
    name_addrs = numpy.ctypeslib.as_array(p, shape=(num_strs,))
    // pass to some numpy functions
    my_numpy_fun(name_addrs)
The challenge is that evaluating the indices of numpy arrays is only going to give you an address, but the memory is the same as the original c pointer. We can cast back to a ctypes.POINTER(ctypes.c_char_p) to access values:
def my_numpy_func(name_addrs):
    names = name_addrs.ctypes.data_as(ctypes.POINTER(ctypes.c_char_p))
    for i in range(len(name_addrs)):
        print names[i]
It's not perfect as I can't use things like numpy.searchsorted to do a binary search at the numpy level, but it does pass around char* without a copy well enough.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With