Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex extraction of text data between 2 commas in R

Tags:

regex

r

stringr

I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:

73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London

Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma

My desired output is:

Wandsworth
Lambeth

I can manage to extract the part before:

df$extraction <- sub('.*,\\s*','',address)

and after

df$extraction <- sub('.*,\\s*','',address)

But not the middle part that I need. Can someone please help?

Many Thanks!

like image 859
RichS Avatar asked Nov 29 '25 00:11

RichS


2 Answers

You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv(), taking advantage of the fact that colClasses can be used to drop columns.

address <- c(
    "73 Greenhill Gardens, Wandsworth, London", 
    "22 Acacia Heights, Lambeth, London"
)

read.csv(text = address, colClasses = c("NULL", "character", "NULL"), 
    header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"   

Or we could use fread(). Its select argument is nice and it strips white space automatically.

data.table::fread(paste(address, collapse = "\n"), 
    select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth" 
like image 104
Rich Scriven Avatar answered Nov 30 '25 14:11

Rich Scriven


Here are a couple of approaches:

# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth" 

Or

# target the whole string, but use a capture group 
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth" 
like image 43
Jota Avatar answered Nov 30 '25 15:11

Jota



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!