Using the XML package and XPath to scrape addresses from websites, I sometimes can get only a string that has embedded in it the zip code I want. It is straightforward to extract the zip code, but sometimes there are other five-digit strings that show up.
Here are some variations on the problem in a df.
zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345")) 
The R statement to extract zip codes (both 5 digit and plus 4 digits) is below, but it is tricked by the faux zip codes of the street number and the suite number (and there may be other possibilities in other address strings).
regmatches(zips$address, gregexpr("\\d{5}([-]?\\d{4})?", zips$address, perl = TRUE))
An answer to a previous SO question suggested that a "regex will return the last consecutive five digit string. It uses a negative look-ahead to ensure the absence of 5-digit strings after the one being returned."
Extracting a zip code from an address string
\b\d{5}\b(?!.*\b\d{5}\b)
But that question and answer deals with PHP and offers an if loop with preg_matches()` I am not familiar with those languages and tools, but the idea might be right.
My question: what R code will find real zip codes and ignore false lookalikes?
This is my first regex answer (I am still learning) so hopefully I don't say anything wrong to lead you in the wrong direction.
Basically, this regex looks for, as you hinted in your question, the last string that looks like a zip code which is not followed by a string that looks like a zip code
the basic syntax is pattern(?!.*pattern) which says to match pattern only if it is not followed (a negative look-ahead assertion, syntax: (?! )) by anything .* and pattern 
so we can replace pattern with what you are interested in finding:
[0-9]{5}(-[0-9]{4})?
that is, a digit string [0-9] of exactly 5 characters {5} (which may optionally be followed ? by another group defined as a hyphen and another digit string of length four (-[0-9]{4})
put it all together with gregexpr to search for the matches and regmatches to interpret the results for me, I get:
zips <- data.frame(id = seq(1, 5), address = c("Company, 18540 Main Ave., City, ST 12345", "Company 18540 Main Ave. City ST 12345-0000", "Company 18540 Main Ave. City State 12345", "Company, 18540 Main Ave., City, ST 12345 USA", "Company, One Main Ave Suite 18540, City, ST 12345")) 
regmatches(zips$address,
           gregexpr('[0-9]{5}(-[0-9]{4})?(?!.*[0-9]{5}(-[0-9]{4})?)', zips$address, perl = TRUE))
# [[1]]
# [1] "12345"
# 
# [[2]]
# [1] "12345-0000"
# 
# [[3]]
# [1] "12345"
# 
# [[4]]
# [1] "12345"
# 
# [[5]]
# [1] "12345"
The qdapRegex package has the rm_zip function for this:
zips <- data.frame(id = seq(1, 5), 
    address = c("Company, 18540 Main Ave., City, ST 12345", 
    "Company 18540 Main Ave. City ST 12345-0000", 
    "Company 18540 Main Ave. City State 12345", 
    "Company, 18540 Main Ave., City, ST 12345 USA", 
    "Company, One Main Ave Suite 18540, City, ST 12345")
)
lapply(rm_zip(zips$address, extract=TRUE), tail, 1)
## [[1]]
## [1] "12345"
## 
## [[2]]
## [1] "12345-0000"
## 
## [[3]]
## [1] "12345"
## 
## [[4]]
## [1] "12345"
## 
## [[5]]
## [1] "12345"
EDIT Per @lawyeR's comments:
I think that you want some regex that is more specific than the dictionary system used by qdapRegex.  The current implementation of rm_zip allows for validation purposes and thus I wouldn't alter the regular expression it uses to be more flexible.  I also wouldn't alter the function rm_zip to have additional parameters/arguments as qdapRegex attempts to have consistently operating functions.
That being said you could create your own function using the rm_ function and supply your own regular expression.  I have done this using both of the parameters specified in your comment:
More complex data set:
zips <- data.frame(id = seq(1, 6), 
    address = c("Company, 18540 Main Ave., City, ST 12345", 
    "Company 18540 Main Ave. City ST 12345-0000", 
    "Company 18540 Main Ave. City State 12345", 
    "Company, 18540 Main Ave., City, ST 12345 USA", 
    "Company, One Main Ave Suite 18540m, City, ST 12345",
    "company 12345678")
)
Function to grab even if a character follows the zip
## paste together a more flexible regular expression    
pat <- pastex(
    "@rm_zip", 
    "(?<!\\d)\\d{5}(?!\\d)",
    "(?<!\\d)\\d{5}-\\d{4}(?!\\d)"
)
## Create your own function that extract is set to TRUE
rm_zip2 <- rm_(pattern=pat, extract=TRUE)
rm_zip2(zips$address)
## [[1]]
## [1] "18540" "12345"
## 
## [[2]]
## [1] "18540"      "12345-0000"
## 
## [[3]]
## [1] "18540" "12345"
## 
## [[4]]
## [1] "18540" "12345"
## 
## [[5]]
## [1] "18540" "12345"
## 
## [[6]]
## [1] NA
Function to extract just 5 digit zips
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE)
rm_zip3(zips$address)
## [[1]]
## [1] "18540" "12345"
## 
## [[2]]
## [1] "18540" "12345"
## 
## [[3]]
## [1] "18540" "12345"
## 
## [[4]]
## [1] "18540" "12345"
## 
## [[5]]
## [1] "18540" "12345"
## 
## [[6]]
## [1] NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With