Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient coding - R regular expression replicate line for each match

Tags:

regex

r

reshape

I've been working on some data wrangling based on a column including free text. I want to identify a set of particular strings from this text, create a column to designate a match and then replicate a row if there are multiple string matches in a particular field. This I have achieved like so (apologies for anyone not feeling festive):

#Example dataframe  
require(stringr)
dats<-data.frame(ID=c(1:5),text=c("rudolph","rudolph the","rudolph the red","rudolph the red nosed","rudolph the red nosed reindeer"))
    dats

#Regular expression 
patt<-c("rudolph","the","red","nosed","reindeer")
    reg.patt<-paste(patt,collapse="|")
    dats$matched<-str_extract_all(dats$text,reg.patt,simplify=TRUE) %>% unlist()

#Re-shape data
dats2<-data.frame("ID"=dats$ID, "text"=dats$text,"match1"=dats$match[,1],"match2"=dats$match[,2],"match3"=dats$match[,3],"match4"=dats$match[,4],"match5"=dats$match[,5])
    dats3<-melt(dats2,id.vars=c("ID","text"))   
    dats3<-dats3[dats3$value!="",]  
    dats3$variable<-NULL
    dats3<-dats3[order(dats3$ID,decreasing=FALSE),]
        dats3

This works absolutely fine, however I'm sure there is a much more efficient way of doing things - does anyone have any suggestions?

Merry Christmas!

like image 458
D.Singleton Avatar asked Mar 12 '26 22:03

D.Singleton


2 Answers

Try cSplit from the splitstackshape package:

library(splitstackshape)
dats$value <- lapply(str_extract_all(dats$text, reg.patt), toString)
cSplit(dats, 'value', direction="long")
# ID                           text    value
#  1:  1                        rudolph  rudolph
#  2:  2                    rudolph the  rudolph
#  3:  2                    rudolph the      the
#  4:  3                rudolph the red  rudolph
#  5:  3                rudolph the red      the
#  6:  3                rudolph the red      red
#  7:  4          rudolph the red nosed  rudolph
#  8:  4          rudolph the red nosed      the
#  9:  4          rudolph the red nosed      red
# 10:  4          rudolph the red nosed    nosed
# 11:  5 rudolph the red nosed reindeer  rudolph
# 12:  5 rudolph the red nosed reindeer      the
# 13:  5 rudolph the red nosed reindeer      red
# 14:  5 rudolph the red nosed reindeer    nosed
# 15:  5 rudolph the red nosed reindeer reindeer
like image 76
Pierre L Avatar answered Mar 15 '26 11:03

Pierre L


Try this:

library(quanteda)

s <- "rudolph the red nosed reindeer"

words <- strsplit(s, " ")[[1]]
do.call(rbind, lapply(words, kwic, x = s))

giving:

                        contextPre  keyword              contextPost
[text1, 1]                       [  rudolph ] the red nosed reindeer
[text1, 2]               rudolph [      the     ] red nosed reindeer
[text1, 3]           rudolph the [      red         ] nosed reindeer
[text1, 4]       rudolph the red [    nosed               ] reindeer
[text1, 5] rudolph the red nosed [ reindeer                       ] 
like image 34
G. Grothendieck Avatar answered Mar 15 '26 13:03

G. Grothendieck



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!