Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, Regex everything before a sequence

I have a data frame column which contains the following strings. I want to parse everything to the left of the 4 or 5 digit number into a separate column

column name
    0   129 4,029.16 08-31-13 8043 304.12 02-28-13 00T773466
    1   19 179.00 03-31-12 T0707440 7794 2,898.56 01-31-13 00T757276
    2   19716 3.36 01-31-15 00T729912 30593 6,041.74 12-31-16 00T770362
    3   19870 131.22 02-28-15 00T709284 4 30611 199.46 02-03-23 12-31-16 00T691818
    4   19984 12.25 02-28-15 00T879664 30647 199.45 12-31-16 00T691818
    5   20505 1,206.13 03-31-15 00T590582 30648 199.45 12-31-16 00T691818
    6   20623 84.11 04-30-15 00T621725 30759 3,325.27 12-31-16 00T580639

So the desired dataframe column would look something like this:

before
0   129 4,029.16 08-31-13
1   19 179.00 03-31-12 T0707440
2   19716 3.36 01-31-15 00T729912
3   19870 131.22 02-28-15 00T709284 4
4   19984 12.25 02-28-15 00T879664
5   20505 1,206.13 03-31-15 00T590582
6   20623 84.11 04-30-15 00T621725

I tried parsing everything to the right of the 4 or 5 digit using

df['after']=df['column_name'].str.extract('(\s\d{4,5}\s+.*)')

And it works but how do I do the same to obtain df['before']?

Any help is greatly appreciated. Thanks!

like image 202
Nev1111 Avatar asked Mar 13 '26 14:03

Nev1111


2 Answers

You're a step away from the expected output, you just need a lookahead :

df["before"] = df["column_name"].str.extract(r"(.*)\s(?=\b\d{4,5}\b)")

​ Output :

print(df["before"])

0                129 4,029.16 08-31-13
1          19 179.00 03-31-12 T0707440
2        19716 3.36 01-31-15 00T729912
3    19870 131.22 02-28-15 00T709284 4
4       19984 12.25 02-28-15 00T879664
5    20505 1,206.13 03-31-15 00T590582
6       20623 84.11 04-30-15 00T621725
Name: after, dtype: object

If you need both (before/after), you can use :

df[["before", "after"]] = df["column_name"].str.extract(r"(.*)\s+\b\d{4,5}\b\s+(.*)")

Output :

                              before                               after
0              129 4,029.16 08-31-13           304.12 02-28-13 00T773466
1        19 179.00 03-31-12 T0707440         2,898.56 01-31-13 00T757276
2      19716 3.36 01-31-15 00T729912         6,041.74 12-31-16 00T770362
3  19870 131.22 02-28-15 00T709284 4  199.46 02-03-23 12-31-16 00T691818
4     19984 12.25 02-28-15 00T879664           199.45 12-31-16 00T691818
5  20505 1,206.13 03-31-15 00T590582           199.45 12-31-16 00T691818
6     20623 84.11 04-30-15 00T621725         3,325.27 12-31-16 00T580639
like image 176
Timeless Avatar answered Mar 15 '26 03:03

Timeless


Alternative, using .str.replace:

df['column_name'] = df['column_name'].str.replace(r'(?<=.{15})\s+\d{4,5}\s+.*', '', regex=True)
print(df)

Prints:

                         column_name
0              129 4,029.16 08-31-13
1        19 179.00 03-31-12 T0707440
2      19716 3.36 01-31-15 00T729912
3  19870 131.22 02-28-15 00T709284 4
4     19984 12.25 02-28-15 00T879664
5  20505 1,206.13 03-31-15 00T590582
6     20623 84.11 04-30-15 00T621725
like image 27
Andrej Kesely Avatar answered Mar 15 '26 04:03

Andrej Kesely