Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract substring using regular expression in R

Tags:

regex

r

gsub

I am new to regular expression and have read http://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf regex documents. I know similar questions have been posted previously, but I still had a difficult time trying to figuring out my case.

I have a vector of string filenames, try to extract substring, and save as new filenames. The filenames follow the the pattern below:

\w_\w_(substring to extract)_\d_\d_Month_Date_Year_Hour_Min_Sec_(AM or PM)

For example, ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM, ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM, the substring will be MS-15-0452-268 and SP56-01_A

I used

map(strsplit(filenames, '_'),3)

but failed, because the new filenames could have _, too.

I turned to regular expression for advanced matching, and come up with this

gsub("^[^\n]+_\\d_\\d_\\d_\\d_(AM | PM)$", "", filenames)

still did not get what I needed.

like image 484
Jian Avatar asked Oct 20 '25 10:10

Jian


1 Answers

You may use

filenames <- c('ABC_DG_MS-15-0452-268_206_281_12_1_2017_1_53_11_PM', 'ABC_RE_SP56-01_A_206_281_12_1_2017_1_52_34_AM')
gsub('^(?:[^_]+_){2}(.+?)_\\d+.*', '\\1', filenames)

Which yields

[1] "MS-15-0452-268" "SP56-01_A"    


The pattern here is
^             # start of the string
(?:[^_]+_){2} # not _, twice
(.+?)         # anything lazily afterwards
_\\d+         # until there's _\d+
.*            # consume the rest of the string

This pattern is replaced by the first captured group and hence the filename in question.

like image 51
Jan Avatar answered Oct 22 '25 00:10

Jan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!