Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting whole word/number occurrences with str_count in R

Tags:

regex

r

stringr

Similar to this case, i would like to count the number of occurrences of multiple words and numbers that occur in a vector of sentences with str_count of the stringr package.

But I noticed that not only whole numbers are counted but also partial numbers. For example:

df <- c("honda civic 1988 with new lights","toyota auris 4x4 140000 km","nissan skyline 2.0 159000 km")
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
library(stringr)
number_of_keywords_df <- str_count(df, paste(keywords, collapse='|'))

Here I recieve a vector for number_of_keywords_df of 3, 3, 3 while clearly, it should be 3, 2, 2. The str_count function seems to count the partial strings "1400" and "159" within the numbers "140000" and "159000". Is there any way of preventing that?

like image 224
Tshabat Avatar asked Nov 29 '25 16:11

Tshabat


2 Answers

Using sprintf you can add word boundaries:

number_of_keywords_df <- str_count(df, paste(sprintf("\\b%s\\b", keywords), collapse = '|'))
number_of_keywords_df

Which yields

[1] 3 2 2
like image 73
Jan Avatar answered Dec 02 '25 05:12

Jan


Try putting word boundaries around your keywords:

keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
keywords <- paste0("\\b", keywords, "\\b")

In regex lingo, \bhonda\b says to match the isolated word honda. Hence hondas would not match because it has an extra letter at the end.

like image 41
Tim Biegeleisen Avatar answered Dec 02 '25 04:12

Tim Biegeleisen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!