I'm trying to extract data from sentences such as:
"monthly payment of 525 and 5000 drive off"
using a python regex search function: re.search()
My regex query string is as follows for down payment:
match1 = "(?P<down_payment>\d+)\s*(|\$|dollars*|money)*\s*" + \
"(down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)*"
My problem is that it matches the wrong numerical value as down payment, it gets both 525, and 5000.
How can I improve my regex string such that it only matches an element if another element is successfully matched as well?
In this case, for example, both 5000 and drive-off matched so we can extract 5000 as down_payment, but 525 did not match with the any down payment values, so it should not even consider the 525.
Clearer explanation here
The point is that you want to match a sequence of patterns. In order to make sure the trailing patterns are taken into account, they cannot be all optional. Look, \s*, (|\$|dollars*|money)*, \s*, (down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)* can match empty strings.
I suggest removing the final * quantifier to match exactly one occurrence of the pattern:
(?P<down_payment>\d+)\s*(?:\$|dollars*|money)?\s*(down|drive[\s-]*off|due\s*at\s*signing|drive\s*-*\s*off)
See the regex demo
Also note that I contracted a (\s|-) group into a character class [\s-] as you only alternate single char patterns, and also turned (|\$|dollars*|money)* into a non-capturing optional group (?:\$|dollars*|money)? that matches just 1 or 0 occurrences of $, dollar(s) or money.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With