I am trying to extract the age of a person from a sentence; this is a bit simplified, but it's all for a research project. I know that in the sentence the age is always preceded by either a colon followed by 0 or more spaces, or a colon, spaces, a few words, and some spaces (example: "character: a lovely eighty year old grandma", I want a regex that will allow me to extract 'eighty' from one of the groups). I am using python's 're' library and my code hangs on this example (code and example below):
regex_age_string = r'([:]*[ ]*)?((([a-z]*)([ -]*))+)([ -]+)(year)'
regex_age_string = re.compile(regex_age_string, re.DOTALL)
sentence = 'history: four year-old boy was really sad when he found
out the toy was broken'
age_extract_string = re.search(regex_age_string, sentence)
print(age_extract_string.group())
print(age_extract_string.group(2))
However, the works when I shorten the sentence by cutting out a few of the tail words. I read up about regex searches hanging because of catastrophic backtracking but I am not sure how that applies here/how to fix it.
The reason your regex causes slowdown is catastrophic backtracking. It is caused by a sequence of optional patterns inside a quantified group - (([a-z]*)([ -]*))+.
You may actually match any letters, spaces or hyphens from a : till year:
r':\s*([a-z\s-]*?)\s*-*year'
See the regex demo.
Details
: - a :\s* - 0+ whitespacves([a-z\s-]*?) - Group 1: 0+ lowercase ASCII letters, whitespaces or hyphens\s* - 0+ whitespaces-* - 0+ - charsyear - a substring.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With