I was wondering whether or not it was possible to remove duplicate sentences or even duplicated blocks of texts, meaning a duplicate set of sentences from a dataframe in R. In my specific case, you could imagine I have saved the posts of a forum but have not highlighted when a person quoted a post that has been made before, and now want to remove all quotes from the different cells containing the different posts. Thanks for any tips or hints.
An example could look something like this:
names <- c("Richard", "Mortimer", "Elizabeth", "Jeremiah")
posts <- c("I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out. That sounds quite aggressive. How about just talking to them in a friendly way, first?", "That sounds quite aggressive. How about just talking to them in a friendly way, first? Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense")
duplicateposts <- data.frame(names, posts)
posts2 <- c("I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.", "That sounds quite aggressive. How about just talking to them in a friendly way, first?", "Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense")
postsnoduplicates <- data.frame(names, posts2)
I think you need to strsplit at the point of sentence ends, find duplicates, then paste back together. Something like:
spl <- strsplit(as.character(duplicateposts$posts), "(?<=[.?!])(?=.)", perl=TRUE)
spl <- lapply(spl, trimws)
spl <- stack(setNames(spl, duplicateposts$names))
aggregate(values ~ ind, data=spl[!duplicated(spl$values),], FUN=paste, collapse=" ")
Resulting in:
# ind values
#1 Richard I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift
#2 Mortimer Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.
#3 Elizabeth That sounds quite aggressive. How about just talking to them in a friendly way, first?
#4 Jeremiah Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With