Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse unformatted dates in Python

I have some text, taken from different websites, that I want to extract dates from. As one can imagine, the dates vary substantially in how they are formatted, and look something like:

Posted: 10/01/2014 
Published on August 1st 2014
Last modified on 5th of July 2014
Posted by Dave on 10-01-14

What I want to know is if anyone knows of a Python library [or API] which would help with this - (other than e.g. regex, which will be my fallback). I could probably relatively easily remove the "posed on" parts, but getting the other stuff consistent does not look easy.

like image 214
kyrenia Avatar asked May 01 '26 14:05

kyrenia


1 Answers

My solution using dateutil

Following Lukas's suggestion, I used the dateutil package (seemed far more flexible than Arrow), using the Fuzzy entry, which basically ignores things which are not dates.

Caution on Fuzzy parsing using dateutil

The main thing to note with this is that as noted in the thread Trouble in parsing date using dateutil if it is unable to parse a day/month/year it takes a default value (which is the current day, unless specified), and as far as i can tell there is no flag reported to indicate that it took the default.

This would result in "random text" returning today's date of 2015-4-16 which could have caused problems.

Solution

Since I really want to know when it fails, rather than fill in the date with a default value, I ended up running twice, and then seeing if it took the default on both instances - if not, then I assumed parsing correctly.

from datetime import datetime
from dateutil.parser import parse

def extract_date(text):

    date = {}
    date_1 = parse(text, fuzzy=True, default=datetime(2001, 01, 01))
    date_2 = parse(text, fuzzy=True, default=datetime(2002, 02, 02))

    if date_1.day == 1 and date_2.day ==2:
        date["day"] = "XX"
    else:
        date["day"] = date_1.day

    if date_1.month == 1 and date_2.month ==2:
        date["month"] = "XX"
    else:
        date["month"] = date_1.month    

    if date_1.year == 2001 and date_2.year ==2002:
        date["year"] = "XXXX"
    else:
        date["year"] = date_1.year  

    return(date)

print extract_date("Posted: by dave August 1st")

Obviously this is a bit of a botch (so if anyone has a more elegant solution -please share), but this correctly parsed the four examples i had above [where it assumed US format for the date 10/01/2014 rather than UK format], and resulted in XX being returned appropriately when missing data entered.

like image 177
kyrenia Avatar answered May 04 '26 11:05

kyrenia



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!