Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas string replace with regex argument for non-regex replacements

Suppose I have a dataframe in which I want to replace a non-regex substring consisting only of characters (i.e. a-z, A-Z) and/or digits (i.e. 0-9) via pd.Series.str.replace. The docs state that this function is equivalent to str.replace or re.sub(), depending on the regex argument (default False).

Apart from most likely being overkill, are there any downsides to consider if the function was called with regex=True for non-regex replacements (e.g. performance)? If so, which ones? Of course, I am not suggesting using the function in this way.

Example: Replace 'Elephant' in the below dataframe.

import pandas as pd

data = {'Animal_Name': ['Elephant African', 'Elephant Asian', 'Elephant Indian', 'Elephant Borneo', 'Elephant Sumatran']}
df = pd.DataFrame(data)

df = df['Animal_Name'].str.replace('Elephant', 'Tiger', regex=True)
like image 904
bproxauf Avatar asked Dec 08 '25 08:12

bproxauf


1 Answers

Special characters!

Using regular expressions with plain words is generally fine (aside from efficiency concerns), there will however be an issue when you have special characters. This is an often overlooked issue and I've seen many people not understanding why their str.replace failed.

Pandas even changed the default regex=True to regex=False, and the original reason for that (#GH24804) was that str.replace('.', '') would remove all characters, which is expected if you know regex, but not at all if you don't.

For example, let's try to replace 1.5 with 2.3 and the $ currency by £:

df = pd.DataFrame({'input': ['This item costs $1.5.', 'We need 125 units.']})

df['out1'] = df['input'].str.replace('1.5', '2.3', regex=False)
df['out1_regex'] = df['input'].str.replace('1.5', '2.3', regex=True)

df['out2'] = df['input'].str.replace('$', '£', regex=False)
df['out2_regex'] = df['input'].str.replace('$', '£', regex=True)

Output:

                   input                   out1             out1_regex  \
0  This item costs $1.5.  This item costs $2.3.  This item costs $2.3.   
1     We need 125 units.     We need 125 units.     We need 2.3 units.   

                    out2              out2_regex  
0  This item costs £1.5.  This item costs $1.5.£  
1     We need 125 units.     We need 125 units.£  

Since . and $ have a special meaning in a regex, those cannot be used as is and should have been escaped (1\.5 / \$), which can be done programmatically with re.escape.

How does str.replace decide to use a regex or a plain string operation?

a pure python replacement will be used if:

  • regex=False
  • pat is a string (passing a compiled regex with regex=False will trigger a ValueError)
  • case is not False
  • no flags are set
  • repl is not a callable

In all other cases, re.sub will be used.

The code that does this is is core/strings/object_array.py:

    def _str_replace(
        self,
        pat: str | re.Pattern,
        repl: str | Callable,
        n: int = -1,
        case: bool = True,
        flags: int = 0,
        regex: bool = True,
    ):
        if case is False:
            # add case flag, if provided
            flags |= re.IGNORECASE

        if regex or flags or callable(repl):
            if not isinstance(pat, re.Pattern):
                if regex is False:
                    pat = re.escape(pat)
                pat = re.compile(pat, flags=flags)

            n = n if n >= 0 else 0
            f = lambda x: pat.sub(repl=repl, string=x, count=n)
        else:
            f = lambda x: x.replace(pat, repl, n)

Efficiency

Considering a pattern without special characters, regex=True is about 6 times slower than regex=False in the linear regime:

comparison regex=True regex=False pandas str.replace

like image 66
mozway Avatar answered Dec 10 '25 16:12

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!