Pandas string replace with regex argument for non-regex replacements

Question

Suppose I have a dataframe in which I want to replace a non-regex substring consisting only of characters (i.e. a-z, A-Z) and/or digits (i.e. 0-9) via pd.Series.str.replace. The docs state that this function is equivalent to str.replace or re.sub(), depending on the regex argument (default False).

Apart from most likely being overkill, are there any downsides to consider if the function was called with regex=True for non-regex replacements (e.g. performance)? If so, which ones? Of course, I am not suggesting using the function in this way.

Example: Replace 'Elephant' in the below dataframe.

import pandas as pd

data = {'Animal_Name': ['Elephant African', 'Elephant Asian', 'Elephant Indian', 'Elephant Borneo', 'Elephant Sumatran']}
df = pd.DataFrame(data)

df = df['Animal_Name'].str.replace('Elephant', 'Tiger', regex=True)

mozway · Accepted Answer

Special characters!

Using regular expressions with plain words is generally fine (aside from efficiency concerns), there will however be an issue when you have special characters. This is an often overlooked issue and I've seen many people not understanding why their str.replace failed.

Pandas even changed the default regex=True to regex=False, and the original reason for that (#GH24804) was that str.replace('.', '') would remove all characters, which is expected if you know regex, but not at all if you don't.

For example, let's try to replace 1.5 with 2.3 and the $ currency by £:

df = pd.DataFrame({'input': ['This item costs $1.5.', 'We need 125 units.']})

df['out1'] = df['input'].str.replace('1.5', '2.3', regex=False)
df['out1_regex'] = df['input'].str.replace('1.5', '2.3', regex=True)

df['out2'] = df['input'].str.replace('$', '£', regex=False)
df['out2_regex'] = df['input'].str.replace('$', '£', regex=True)

Output:

                   input                   out1             out1_regex  \
0  This item costs $1.5.  This item costs $2.3.  This item costs $2.3.   
1     We need 125 units.     We need 125 units.     We need 2.3 units.   

                    out2              out2_regex  
0  This item costs £1.5.  This item costs $1.5.£  
1     We need 125 units.     We need 125 units.£

Since . and $ have a special meaning in a regex, those cannot be used as is and should have been escaped (1\.5 / \$), which can be done programmatically with re.escape.

How does `str.replace` decide to use a regex or a plain string operation?

a pure python replacement will be used if:

regex=False
pat is a string (passing a compiled regex with regex=False will trigger a ValueError)
case is not False
no flags are set
repl is not a callable

In all other cases, re.sub will be used.

The code that does this is is core/strings/object_array.py:

    def _str_replace(
        self,
        pat: str | re.Pattern,
        repl: str | Callable,
        n: int = -1,
        case: bool = True,
        flags: int = 0,
        regex: bool = True,
    ):
        if case is False:
            # add case flag, if provided
            flags |= re.IGNORECASE

        if regex or flags or callable(repl):
            if not isinstance(pat, re.Pattern):
                if regex is False:
                    pat = re.escape(pat)
                pat = re.compile(pat, flags=flags)

            n = n if n >= 0 else 0
            f = lambda x: pat.sub(repl=repl, string=x, count=n)
        else:
            f = lambda x: x.replace(pat, repl, n)

Efficiency

Considering a pattern without special characters, regex=True is about 6 times slower than regex=False in the linear regime:

comparison regex=True regex=False pandas str.replace

Pandas string replace with regex argument for non-regex replacements

Tags:

python

replace

pandas

bproxauf

1 Answers

Special characters!

How does `str.replace` decide to use a regex or a plain string operation?

Efficiency

mozway

Recent Activity

Donate For Us

Pandas string replace with regex argument for non-regex replacements

Tags:

python

replace

pandas

bproxauf

1 Answers

Special characters!

How does str.replace decide to use a regex or a plain string operation?

Efficiency

mozway

Related questions

Recent Activity

Donate For Us

How does `str.replace` decide to use a regex or a plain string operation?