Python: efficient multi-string replace

Question

Can this function be made more efficient? I need to process a million names...

def indian_soundex_encode(s):
    s = s.replace("aa", "a")
    s = s.replace("ee", "i")
    s = s.replace("zh", "l")
    s = s.replace("oo", "u")
    s = s.replace("bu", "b")
    s = s.replace("dh", "d")
    s = s.replace("gh", "g")
    s = s.replace("jh", "j")
    s = s.replace("kh", "k")
    s = s.replace("sh", "s")
    s = s.replace("th", "t")
    s = s.replace("ck", "k")
    s = s.replace("kk", "k")
    s = s.replace("nn", "n")
    s = s.replace("mm", "m")
    s = s.replace("pp", "p")
    s = s.replace("ll", "l")
    s = s.replace("ty", "ti")
    s = s.replace("ot", "od")
    s = s.replace("iya", "ia")
    s = s.replace("ya", "ia")
    s = s.replace("sv", "s")
    s = s.replace("sw", "s")
    s = s.replace("my", "mi")
    return s

user4815162342 · Accepted Answer

It will be hard to make the function more efficient using pure Python. str.replace is already fairly efficient, but it does need to scan the strings many times and, at least in some cases, create several new strings. Replacing multiple calls to replace with a smarter algorithm that only scans the string once, would likely make the function slower because you'd be doing more work in pure Python and give up the raw efficiency of str.replace.

If writing a C extension module is possible in your case, I'd recommend doing that. Measuring with timeit, the following function outperforms the original one by a factor of ~17 (0.184 usec compared to 3.28 usec for the Python version) for the sample string "foobaaar".

PyObject *
indian_soundex_encode(PyObject *ignore, PyObject *args)
{
  PyObject *py_s, *py_ret;
  bool replaced = false;
  if (!PyArg_ParseTuple(args, "S", &py_s))
    return NULL;

  const char *s = PyString_AS_STRING(py_s);
  Py_ssize_t len = PyString_GET_SIZE(py_s);
  char *ret = malloc(len + 1), *retptr = ret;
  if (!ret)
    return PyErr_NoMemory();

  while (len > 0) {
#define REPLACE(first, second, replacement)     \
    if (*s == first && *(s + 1) == second) {    \
      s += 2;                                   \
      len -= 2;                                 \
      *retptr++ = replacement;                  \
      replaced = true;                          \
      continue;                                 \
    }

    REPLACE('a', 'a', 'a');
    REPLACE('e', 'e', 'i');
    REPLACE('z', 'h', 'l');
    REPLACE('o', 'o', 'u');
    REPLACE('b', 'u', 'b');
    REPLACE('d', 'h', 'd');
    REPLACE('g', 'h', 'g');
    REPLACE('j', 'h', 'j');
    REPLACE('k', 'h', 'k');
    REPLACE('s', 'h', 's');
    REPLACE('t', 'h', 't');
    REPLACE('c', 'k', 'k');
    REPLACE('k', 'k', 'k');
    REPLACE('n', 'n', 'n');

#undef REPLACE
    *retptr++ = *s++;
    --len;
  }
  if (!replaced) {
    py_ret = py_s;
    Py_INCREF(py_ret);
  }
  else
    py_ret = PyString_FromStringAndSize(ret, retptr - ret);
  free(ret);
  return py_ret;
}

The above function can likely be further sped up with the use of switch statement or even more efficient lookup tables coded in C, but that is left as an exercise to the reader.

It would be another interesting exercise to try to code a version of this function in Cython, and compare its performance to the above hand-written C extension.

Update: The above C function corresponds to the original Python code in the question. The editor Jost snuck in a major code change along with the formatting change in his edit, which has apparently gone undetected by reviewers.

Python: efficient multi-string replace

Tags:

python

replace

soundex

swami

1 Answers

user4815162342

Recent Activity

Donate For Us

Python: efficient multi-string replace

Tags:

python

replace

soundex

swami

1 Answers

user4815162342

Related questions

Recent Activity

Donate For Us