Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word count across subset of columns in one new column

Tags:

r

I have the following data frame:

structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
"This is no text"), y = c("What is text?", "Can it eat text?", 
"Maybe I will try.")), class = "data.frame", row.names = c(NA, 
-3L))

I would like to count the number of words across the columns x and y and sum up the value to get one column with the total number of words used per column. It is important that I am able to subset the data. The result shoud look like this:

structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
"This is no text"), y = c("What is text?", "Can it eat text?", 
"Maybe I will try."), z = c("6", "8", "8")), class = "data.frame", row.names = c(NA, 
-3L))

I have tried using str_count(" ") with different regex expressions in combination with across or apply but I do not seem to get the solution.

I did not anticipate in my original question that columns with NA cells in them would be problematic, but I do. So any solution needs to be able to handle NA cells as well.

like image 325
flxflks Avatar asked Nov 18 '25 15:11

flxflks


2 Answers

Here solution using tokenizers:

library(tokenizers)

df <- 
  structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
  "This is no text"), y = c("What is text?", "Can it eat text?", 
  "Maybe I will try.")), class = "data.frame", row.names = c(NA, 
  -3L))

df$z = tokenizers::count_words(df$x) + tokenizers::count_words(df$y)

df
#>   g                 x                 y z
#> 1 1     This is text.     What is text? 6
#> 2 2 This is text too.  Can it eat text? 8
#> 3 3   This is no text Maybe I will try. 8

If you prefer pure R:

df$z <- rowSums(
  sapply(df[,c("x","y")],function(x)  
    sapply(gregexpr("\\b\\w+\\b", x) , function(x) 
      if(x[[1]] > 0) length(x) else 0)))

Note that \w+ matches all words and \b matches word boundaries, though i believe "\w" suffices

like image 96
Ric Avatar answered Nov 21 '25 05:11

Ric


One possible solution:

df$z = stringi::stri_count_words(paste(df$x, df$y))

  g                 x                 y z
1 1     This is text.     What is text? 6
2 2 This is text too.  Can it eat text? 8
3 3   This is no text Maybe I will try. 8

Or

df$z = lengths(gregexpr("\\b\\w+\\b", paste(df$x, df$y)))
like image 36
B. Christian Kamgang Avatar answered Nov 21 '25 04:11

B. Christian Kamgang