Word count across subset of columns in one new column

Question

I have the following data frame:

structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
"This is no text"), y = c("What is text?", "Can it eat text?", 
"Maybe I will try.")), class = "data.frame", row.names = c(NA, 
-3L))

I would like to count the number of words across the columns x and y and sum up the value to get one column with the total number of words used per column. It is important that I am able to subset the data. The result shoud look like this:

structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
"This is no text"), y = c("What is text?", "Can it eat text?", 
"Maybe I will try."), z = c("6", "8", "8")), class = "data.frame", row.names = c(NA, 
-3L))

I have tried using str_count(" ") with different regex expressions in combination with across or apply but I do not seem to get the solution.

I did not anticipate in my original question that columns with NA cells in them would be problematic, but I do. So any solution needs to be able to handle NA cells as well.

Ric · Accepted Answer

Here solution using tokenizers:

library(tokenizers)

df <- 
  structure(list(g = c("1", "2", "3"), x = c("This is text.", "This is text too.", 
  "This is no text"), y = c("What is text?", "Can it eat text?", 
  "Maybe I will try.")), class = "data.frame", row.names = c(NA, 
  -3L))

df$z = tokenizers::count_words(df$x) + tokenizers::count_words(df$y)

df
#>   g                 x                 y z
#> 1 1     This is text.     What is text? 6
#> 2 2 This is text too.  Can it eat text? 8
#> 3 3   This is no text Maybe I will try. 8

If you prefer pure R:

df$z <- rowSums(
  sapply(df[,c("x","y")],function(x)  
    sapply(gregexpr("\b\w+\b", x) , function(x) 
      if(x[[1]] > 0) length(x) else 0)))

Note that \w+ matches all words and \b matches word boundaries, though i believe "\w" suffices

B. Christian Kamgang · Answer

One possible solution:

df$z = stringi::stri_count_words(paste(df$x, df$y))

  g                 x                 y z
1 1     This is text.     What is text? 6
2 2 This is text too.  Can it eat text? 8
3 3   This is no text Maybe I will try. 8

Or

df$z = lengths(gregexpr("\b\w+\b", paste(df$x, df$y)))

Word count across subset of columns in one new column

Tags:

r

flxflks

2 Answers

Ric

B. Christian Kamgang

Recent Activity

Donate For Us

Word count across subset of columns in one new column

Tags:

r

flxflks

2 Answers

Ric

B. Christian Kamgang

Related questions

Recent Activity

Donate For Us