I'm working with video transcript data. The data was automatically exported with a return mid-sentence. I'd like to combine the spoken lines into a single row. The data is formatted as such:
data$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've",
"heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct",
"about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>"))
Intended output:
intendedData$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>"))
I've tried conditional statements for rows that start with <v and end with , but that didn't work. Any ideas will be greatly appreciated. Thank you!
An approach using strsplit
and paste
. (Same idea as @Allan Cameron, but different execution).
tmp <- trimws(strsplit(paste(data$transcript, collapse=" "), "<v|<\\/v>")[[1]])
ifelse(grepl("\\d{2}:\\d{2}:\\d{2}\\.\\d{3}", tmp), tmp, paste0("<v ", tmp, "</v>"))
[1] "00:00:03.990 --> 00:00:05.270"
[2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
[3] "00:00:05.770 --> 00:00:07.370"
[4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
[5] "00:00:10.810 --> 00:00:11.170"
[6] "<v Bill>Awesome.</v>"
Without temporary variable
trimws(strsplit(paste(data$transcript, collapse=" "), "<v|<\\/v>")[[1]]) |>
(\(x) ifelse(grepl("\\d{2}:\\d{2}:\\d{2}\\.\\d{3}", x), x, paste0("<v ", x, "</v>")))()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With