I have text in a column I am cleaning, I need to remove all words between the words "Original" and "Subject" wherever they appear in the column which is only some of the rows.
I am currently trying
a = df['textcol']
import re
df['textcol'] =re.sub('Original.*?Subject','',str(a), flags=re.DOTALL)
this function is making every string within every tow the exact same as the first row it instead of looking at each row individually and altering it
You need to use Series.str.replace
directly:
df['textcol'] = df['textcol'].str.replace(r'(?s)Original.*?Subject', '', regex=True)
Here, (?s)
stands for re.DOTALL
/ re.S
in order not to have to import re
, it is their inline modifier version. The .*?
matches any zero or more chars, as few as possible.
If Original
and Subject
need to be passed as variables containing literal text, do not forget about re.escape
:
import re
# ... etc. ...
start = "Original"
end = "Subject"
df['textcol'] = df['textcol'].str.replace(fr'(?s){re.escape(start)}.*?{re.escape(end)}', '', regex=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With