Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word substitution within tidy text format

Hi i'm working with a tidy_text format and i am trying to substitute the strings "emails" and "emailing" into "email".

set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>% 
unnest_tokens(word, sentence)

tidy_df %>% 
count(word, sort = TRUE) %>% 
filter( n > 20) %>% 
mutate(word = reorder(word, n)) %>% 
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) + 
coord_flip()

this works fine, but when i use:

 tidy_df <- gsub("emailing", "email", tidy_df)

to substitute words and run the bar chart again i get the following error message:

Error in UseMethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"

Does any one know how to easily substitute words within tidy text formats without changing structure/class of the tidy_text?

like image 372
Benjamin Telkamp Avatar asked Oct 15 '25 12:10

Benjamin Telkamp


1 Answers

Removing the ends of words like that is called stemming and there are a couple of packages in R that will do that for you, if you'd like. One is the hunspell package from rOpenSci, and another option is the SnowballC package which implements Porter algorithm stemming. You would implement that like so:

library(dplyr)
library(tidytext)
library(SnowballC)

terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2       i
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7       i
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

Notice that it is stemming all your text and that some of the words don't look like real words anymore; you may or may not care about that.

If you don't want to stem all your text using a stemmer like SnowballC or hunspell, you can use dplyr's if_else within mutate() to replace just specific words.

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

Or it might make more sense for you to use str_replace from the stringr package.

library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows
like image 174
Julia Silge Avatar answered Oct 18 '25 07:10

Julia Silge



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!