I have a datset consisting of sentences with consecutive word repeats:
DATA:
df <- data.frame(
Turn = c("oh is that that steak i got the other night", # that that
"no no no i 'm dave and you 're alan", # no no no
"yeah i mean the the film was quite long though", # the the
"it had steve martin in it it 's a comedy")) # it it
OBJECTIVE:
What I'd like to obtain are three more columns added to this dataframe:
df$rep_Word: a column specifying the word that gets repeateddf$rep_Pos: a column specifying the first position in the sentence at which the word is repeateddf$rep_Numb: a column specifying the number of times the word gets repeatedSo the expected dataframe looks like this:
EXPECTED RESULT:
df
Turn rep_Word rep_Pos rep_Numb
1 oh is that that steak i got the other night that 4 1
2 no no no i 'm dave and you 're alan no 2 2
3 yeah i mean the the film was quite long though the 5 1
4 it had steve martin in it it 's a comedy it 7 1
ATTEMPTED SOLUTION SO FAR:
My hunch is that the sought information on repeated word, and position and number of repeats can be approached with strsplit and the function duplicated, e.g., thus:
df_split <- apply(df, 2, function(x) strsplit(x, "\\s"))
df_split
$Turn
$Turn[[1]]
[1] "oh" "is" "that" "that" "steak" "i" "got" "the" "other" "night"
$Turn[[2]]
[1] "no" "no" "no" "i" "'m" "dave" "and" "you" "'re" "alan"
$Turn[[3]]
[1] "yeah" "i" "mean" "the" "the" "film" "was" "quite" "long" "though"
$Turn[[4]]
[1] "it" "had" "steve" "martin" "in" "it" "it" "'s" "a" "comedy"
For example, for the first sentence in df, duplicatedshows which word gets repeated (namely the one for which duplicated evaluates to TRUE) and both number and position of the repeat could also be read-off that information:
duplicated(df_split$Turn[[1]])
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The problem is that I don't know how to operationalize duplicatedin such a way as to obtain the desired added columns in df. Help with that endeavor is much appreciated.
Here is another way to solve your problem.
df <- data.frame(
Turn = c("oh is that that steak i got the other night", # that that
"no no no i 'm dave and you 're alan", # no no no
"yeah i mean the the film was quite long though", # the the
"it had steve martin in it it 's a comedy", # it it)
"it had steve martin in in it it 's a comedy",
"yeah i mean the film was quite long though",
"hi hi then other words and hi hi again",
"no no no i 'm dave yes yes and you 're alan no no no no")) # no no no and no no no no
library(data.table)
cols <- c("rep_Word", "rep_Pos", "rep_Numb")
setDT(df)[, (cols) := {
words <- strsplit(as.character(Turn), " ")[[1]]
idx <- rleid(words)
check <- duplicated(idx)
chg <- check - shift(check, fill = FALSE)
starts <- which(chg == 1)
aend <- if(sum(chg) == 0L) which(chg == -1) else c(which(chg == -1), length(chg) + 1L)
freq <- aend - starts
wrd <- words[starts]
no_dup_default <- .(.(NA_character_), .(NA_integer_), .(NA_integer_))
if(length(wrd)) .(.(wrd), .(starts), .(freq)) else no_dup_default
}, seq.int(nrow(df))]
df
# Turn rep_Word rep_Pos rep_Numb
# 1: oh is that that steak i got the other night that 4 1
# 2: no no no i 'm dave and you 're alan no 2 2
# 3: yeah i mean the the film was quite long though the 5 1
# 4: it had steve martin in it it 's a comedy it 7 1
# 5: it had steve martin in in it it 's a comedy in,it 6,8 1,1
# 6: yeah i mean the film was quite long though NA NA NA
# 7: hi hi then other words and hi hi again hi,hi 2,8 1,1
# 8: no no no i 'm dave yes yes and you 're alan no no no no no,yes,no 2, 8,14 2,1,3
#
# or
df[, lapply(.SD, unlist), seq.int(nrow(df))][, -1]
# Turn rep_Word rep_Pos rep_Numb
# 1: oh is that that steak i got the other night that 4 1
# 2: no no no i 'm dave and you 're alan no 2 2
# 3: yeah i mean the the film was quite long though the 5 1
# 4: it had steve martin in it it 's a comedy it 7 1
# 5: it had steve martin in in it it 's a comedy in 6 1
# 6: it had steve martin in in it it 's a comedy it 8 1
# 7: yeah i mean the film was quite long though <NA> NA NA
# 8: hi hi then other words and hi hi again hi 2 1
# 9: hi hi then other words and hi hi again hi 8 1
# 10: no no no i 'm dave yes yes and you 're alan no no no no no 2 2
# 11: no no no i 'm dave yes yes and you 're alan no no no no yes 8 1
# 12: no no no i 'm dave yes yes and you 're alan no no no no no 14 3
One purrr, dplyr and tibble option could be:
bind_cols(df,
map_dfr(strsplit(df$Turn, " ", fixed = TRUE),
~ enframe(., value = "rep_word") %>%
group_by(rleid = with(rle(rep_word), rep(seq_along(lengths), lengths))) %>%
filter(n() > 1) %>%
summarise(rep_word = first(rep_word),
rep_pos = nth(name, 2),
rep_number = n()-1) %>%
select(-rleid) %>%
summarise_all(toString)))
Turn rep_word rep_pos rep_number
1 oh is that that steak i got the other night that 4 1
2 no no no i 'm dave and you 're alan no 2 2
3 yeah i mean the the film was quite long though the 5 1
4 it had steve martin in it it 's a comedy it 7 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With