Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove text between two specific words in a dataframe column by python

I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.

I am currently trying

   a = df['textcol']
   import re
   df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)

this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it

like image 946
Arica Christensen Avatar asked Oct 16 '25 03:10

Arica Christensen


1 Answers

You need to use Series.str.replace directly:

df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)

Here, (?s) stands for re.DOTALL / re.S in order not to have to import re, it is their inline modifier version. The .*? matches any zero or more chars, as few as possible.

If Original and Subject need to be passed as variables containing literal text, do not forget about re.escape:

import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)
like image 160
Wiktor Stribiżew Avatar answered Oct 17 '25 15:10

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!