How to remove text between two specific words in a dataframe column by python

Question

I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.

I am currently trying

   a = df['textcol']
   import re
   df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)

this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it

Wiktor Stribiżew · Accepted Answer

You need to use Series.str.replace directly:

df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)

Here, (?s) stands for re.DOTALL / re.S in order not to have to import re, it is their inline modifier version. The .*? matches any zero or more chars, as few as possible.

If Original and Subject need to be passed as variables containing literal text, do not forget about re.escape:

import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)

How to remove text between two specific words in a dataframe column by python

Tags:

python

string

regex

Arica Christensen

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

How to remove text between two specific words in a dataframe column by python

Tags:

python

string

regex

Arica Christensen

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us