I have a data frame column which contains the following strings. I want to parse everything to the left of the 4 or 5 digit number into a separate column
column name
0 129 4,029.16 08-31-13 8043 304.12 02-28-13 00T773466
1 19 179.00 03-31-12 T0707440 7794 2,898.56 01-31-13 00T757276
2 19716 3.36 01-31-15 00T729912 30593 6,041.74 12-31-16 00T770362
3 19870 131.22 02-28-15 00T709284 4 30611 199.46 02-03-23 12-31-16 00T691818
4 19984 12.25 02-28-15 00T879664 30647 199.45 12-31-16 00T691818
5 20505 1,206.13 03-31-15 00T590582 30648 199.45 12-31-16 00T691818
6 20623 84.11 04-30-15 00T621725 30759 3,325.27 12-31-16 00T580639
So the desired dataframe column would look something like this:
before
0 129 4,029.16 08-31-13
1 19 179.00 03-31-12 T0707440
2 19716 3.36 01-31-15 00T729912
3 19870 131.22 02-28-15 00T709284 4
4 19984 12.25 02-28-15 00T879664
5 20505 1,206.13 03-31-15 00T590582
6 20623 84.11 04-30-15 00T621725
I tried parsing everything to the right of the 4 or 5 digit using
df['after']=df['column_name'].str.extract('(\s\d{4,5}\s+.*)')
And it works but how do I do the same to obtain df['before']?
Any help is greatly appreciated. Thanks!
You're a step away from the expected output, you just need a lookahead :
df["before"] = df["column_name"].str.extract(r"(.*)\s(?=\b\d{4,5}\b)")
Output :
print(df["before"])
0 129 4,029.16 08-31-13
1 19 179.00 03-31-12 T0707440
2 19716 3.36 01-31-15 00T729912
3 19870 131.22 02-28-15 00T709284 4
4 19984 12.25 02-28-15 00T879664
5 20505 1,206.13 03-31-15 00T590582
6 20623 84.11 04-30-15 00T621725
Name: after, dtype: object
If you need both (before/after), you can use :
df[["before", "after"]] = df["column_name"].str.extract(r"(.*)\s+\b\d{4,5}\b\s+(.*)")
Output :
before after
0 129 4,029.16 08-31-13 304.12 02-28-13 00T773466
1 19 179.00 03-31-12 T0707440 2,898.56 01-31-13 00T757276
2 19716 3.36 01-31-15 00T729912 6,041.74 12-31-16 00T770362
3 19870 131.22 02-28-15 00T709284 4 199.46 02-03-23 12-31-16 00T691818
4 19984 12.25 02-28-15 00T879664 199.45 12-31-16 00T691818
5 20505 1,206.13 03-31-15 00T590582 199.45 12-31-16 00T691818
6 20623 84.11 04-30-15 00T621725 3,325.27 12-31-16 00T580639
Alternative, using .str.replace:
df['column_name'] = df['column_name'].str.replace(r'(?<=.{15})\s+\d{4,5}\s+.*', '', regex=True)
print(df)
Prints:
column_name
0 129 4,029.16 08-31-13
1 19 179.00 03-31-12 T0707440
2 19716 3.36 01-31-15 00T729912
3 19870 131.22 02-28-15 00T709284 4
4 19984 12.25 02-28-15 00T879664
5 20505 1,206.13 03-31-15 00T590582
6 20623 84.11 04-30-15 00T621725
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With