I have the following sentence:
**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**
I would like to extract only those words that are defined as relevant: I, WANT, ONLY, THESE, WORDS, NEXT, STEP. All other characters (numeric, alpha, special) should be removed from the sentence.
In this case, the resulting sentence would be:
I WANT ONLY THESE.
I have thousands of lines like these and each has its own set of characters between the useful words. Is there an efficient way to get rid of these in R?
string <- "**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**"
regmatches(string, gregexpr("I|WANT|ONLY|THESE|WORDS|NEXT|STEP",
string))
[[1]]
[1] "I" "WANT" "ONLY" "THESE"
EDIT: If you want to then convert back to a sentence, say I store the matches in an object called matches:
sentencify <- function(sentence){
paste0(paste(sentence, collapse=" "), ".")
}
lapply(matched, sentencify)
[[1]]
[1] "I WANT ONLY THESE."
Here is one approach, assuming you have a list to check against:
> mystring2 <- "**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**"
> mystring2
[1] "**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**"
> temp <- strsplit(mystring2, "[^a-zA-Z]")[[1]]
> myWords <- c("I", "WANT", "ONLY", "THESE", "WORDS", "NEXT", "STEP")
> temp[temp %in% myWords]
[1] "I" "WANT" "ONLY" "THESE"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With