Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Date extraction using regex in Python Pandas

I have a Pandas series with many text entries and I am trying to extract all the dates and sort them. The dates are written in many different formats, so the challenge is getting them all and sorting them correctly. I've been able to use 'str.findall' to successfully make a list of all dates, but it doesn't split the dates into month, day and year, so I can't really sort them. I've then tried using 'str.extractall', but it is not working how I expected.

Example with dates of type mm/dd/yyyy, mm/yyyy and so on:

import pandas as pd

df = pd.Series(['1/1994 Primary Care Doctor:\n', 'sshe plans to move as of 7/8/71 In-Home Services: None\n', 'Reports MRI of brain done 12/2004 at Gravette Medical Center was WNLPrior EEG:\n'])

wfind = df.str.findall(r'\d{1,2}[/-]\d{1,2}?[/-]?\d{2,4}')
wextract = df.str.extractall(r'(\d{1,2})[/-](\d{1,2})?[/-]?(\d{2,4})')

With extract it splits the year into two. Any suggestions on how to handle this?

The output I am hoping for is a DataFrame with one column containing months, another containing day and a third containing year. Expecting NaN to be in the column containing day and month in case there are neither.

like image 221
JFK Avatar asked Dec 21 '25 06:12

JFK


1 Answers

It looks like you want to extract dates that look like this: the first number is a month, then comes a day, and then a year that starts with 1 or 2, then contains any digit and then can optionally be followed with two digits.

You can use

df.str.extract(r'\b(?:(0?[1-9]|1[0-2])[/-])?(?:(0?[1-9]|[12]\d|3[01])[/-])?((?:[12]\d)?\d{2})\b', expand=False)

See the regex demo.

More details

  • \b - a word boundary
  • (?:(0?[1-9]|[12]\d|3[01])[/-])? - an optional sequence of
    • (0?[1-9]|[12]\d|3[01]) - Group 1: an optional 0 and then a non-zero digit, or 1 or 2 and any digit, or a 3 and then either 0 or 1
    • [/-] - a / or -
  • (?:(0?[1-9]|1[0-2])[/-])? - an optional sequence of
    • (0?[1-9]|1[0-2]) - Group 2: an optional 0 and then a non-zero digit, or 1 and then either 0, 1 or 2
    • [/-] - a / or -
  • ((?:[12]\d)?\d{2}) - Group 3: an optional sequence of 1 or 2, then any one digit and then two digits
  • \b - a word boundary

Pandas code:

import pandas as pd
df = pd.DataFrame({'text':['1/1994 Primary Care Doctor:\n', 'sshe plans to move as of 7/8/71 In-Home Services: None\n', 'Reports MRI of brain done 12/2004 at Gravette Medical Center was WNLPrior EEG:\n']})
df[['month','day','year']] = df['text'].str.extract(r'\b(?:(0?[1-9]|1[0-2])[/-])?(?:(0?[1-9]|[12]\d|3[01])[/-])?((?:[12]\d)?\d{2})\b', expand=False)
df
                                                text month  day  year
0                      1/1994 Primary Care Doctor:\n     1  NaN  1994
1  sshe plans to move as of 7/8/71 In-Home Servic...     7    8    71
2  Reports MRI of brain done 12/2004 at Gravette ...    12  NaN  2004
like image 119
Wiktor Stribiżew Avatar answered Dec 23 '25 19:12

Wiktor Stribiżew