I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested". 
The numbers range from 1-99.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars from 'arrests|arrested'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With