Regex extraction of text data between 2 commas in R

Question

I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:

73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London

Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma

My desired output is:

Wandsworth
Lambeth

I can manage to extract the part before:

df$extraction <- sub('.*,\s*','',address)

and after

df$extraction <- sub('.*,\s*','',address)

But not the middle part that I need. Can someone please help?

Many Thanks!

Rich Scriven · Accepted Answer

You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv(), taking advantage of the fact that colClasses can be used to drop columns.

address <- c(
    "73 Greenhill Gardens, Wandsworth, London", 
    "22 Acacia Heights, Lambeth, London"
)

read.csv(text = address, colClasses = c("NULL", "character", "NULL"), 
    header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"

Or we could use fread(). Its select argument is nice and it strips white space automatically.

data.table::fread(paste(address, collapse = "
"), 
    select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"

Jota · Answer

Here are a couple of approaches:

# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"

Or

# target the whole string, but use a capture group 
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\1", address)
#[1] "Wandsworth" "Lambeth"

Regex extraction of text data between 2 commas in R

Tags:

regex

r

stringr

RichS

2 Answers

Rich Scriven

Jota

Recent Activity

Donate For Us

Regex extraction of text data between 2 commas in R

Tags:

regex

r

stringr

RichS

2 Answers

Rich Scriven

Jota

Related questions

Recent Activity

Donate For Us