Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: efficient multi-string replace

Can this function be made more efficient? I need to process a million names...

def indian_soundex_encode(s):
    s = s.replace("aa", "a")
    s = s.replace("ee", "i")
    s = s.replace("zh", "l")
    s = s.replace("oo", "u")
    s = s.replace("bu", "b")
    s = s.replace("dh", "d")
    s = s.replace("gh", "g")
    s = s.replace("jh", "j")
    s = s.replace("kh", "k")
    s = s.replace("sh", "s")
    s = s.replace("th", "t")
    s = s.replace("ck", "k")
    s = s.replace("kk", "k")
    s = s.replace("nn", "n")
    s = s.replace("mm", "m")
    s = s.replace("pp", "p")
    s = s.replace("ll", "l")
    s = s.replace("ty", "ti")
    s = s.replace("ot", "od")
    s = s.replace("iya", "ia")
    s = s.replace("ya", "ia")
    s = s.replace("sv", "s")
    s = s.replace("sw", "s")
    s = s.replace("my", "mi")
    return s
like image 564
swami Avatar asked May 19 '26 09:05

swami


1 Answers

It will be hard to make the function more efficient using pure Python. str.replace is already fairly efficient, but it does need to scan the strings many times and, at least in some cases, create several new strings. Replacing multiple calls to replace with a smarter algorithm that only scans the string once, would likely make the function slower because you'd be doing more work in pure Python and give up the raw efficiency of str.replace.

If writing a C extension module is possible in your case, I'd recommend doing that. Measuring with timeit, the following function outperforms the original one by a factor of ~17 (0.184 usec compared to 3.28 usec for the Python version) for the sample string "foobaaar".

PyObject *
indian_soundex_encode(PyObject *ignore, PyObject *args)
{
  PyObject *py_s, *py_ret;
  bool replaced = false;
  if (!PyArg_ParseTuple(args, "S", &py_s))
    return NULL;

  const char *s = PyString_AS_STRING(py_s);
  Py_ssize_t len = PyString_GET_SIZE(py_s);
  char *ret = malloc(len + 1), *retptr = ret;
  if (!ret)
    return PyErr_NoMemory();

  while (len > 0) {
#define REPLACE(first, second, replacement)     \
    if (*s == first && *(s + 1) == second) {    \
      s += 2;                                   \
      len -= 2;                                 \
      *retptr++ = replacement;                  \
      replaced = true;                          \
      continue;                                 \
    }

    REPLACE('a', 'a', 'a');
    REPLACE('e', 'e', 'i');
    REPLACE('z', 'h', 'l');
    REPLACE('o', 'o', 'u');
    REPLACE('b', 'u', 'b');
    REPLACE('d', 'h', 'd');
    REPLACE('g', 'h', 'g');
    REPLACE('j', 'h', 'j');
    REPLACE('k', 'h', 'k');
    REPLACE('s', 'h', 's');
    REPLACE('t', 'h', 't');
    REPLACE('c', 'k', 'k');
    REPLACE('k', 'k', 'k');
    REPLACE('n', 'n', 'n');

#undef REPLACE
    *retptr++ = *s++;
    --len;
  }
  if (!replaced) {
    py_ret = py_s;
    Py_INCREF(py_ret);
  }
  else
    py_ret = PyString_FromStringAndSize(ret, retptr - ret);
  free(ret);
  return py_ret;
}

The above function can likely be further sped up with the use of switch statement or even more efficient lookup tables coded in C, but that is left as an exercise to the reader.

It would be another interesting exercise to try to code a version of this function in Cython, and compare its performance to the above hand-written C extension.

Update: The above C function corresponds to the original Python code in the question. The editor Jost snuck in a major code change along with the formatting change in his edit, which has apparently gone undetected by reviewers.

like image 85
user4815162342 Avatar answered May 20 '26 22:05

user4815162342



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!