I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:
73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London
Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma
My desired output is:
Wandsworth
Lambeth
I can manage to extract the part before:
df$extraction <- sub('.*,\\s*','',address)
and after
df$extraction <- sub('.*,\\s*','',address)
But not the middle part that I need. Can someone please help?
Many Thanks!
You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv(), taking advantage of the fact that colClasses can be used to drop columns.
address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London"
)
read.csv(text = address, colClasses = c("NULL", "character", "NULL"),
header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Or we could use fread(). Its select argument is nice and it strips white space automatically.
data.table::fread(paste(address, collapse = "\n"),
select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Here are a couple of approaches:
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"
Or
# target the whole string, but use a capture group
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With