Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove specific words from a column

I'm looking to remove specific words (for example "co" "INC" etc) from a column in data without removing the same letters from other words in the same column. In other words, I only want to remove these words when they are free standing.

This is a an example of what a few rows and columns of the company_name data looks like:

State Company Name number of workers
x COLGATE-PALMOLIVE CO. 10
y OLD COPPER CO INC 77
z NIKE INC -CL B 5
r COMMERCIAL METALS 23
w CARNIVAL CORPORATION & PLC 89

I used the following code to remove the words:

remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, "-CL B|INC|CORP|CO|CO."))

What I got in return was not what was expecting. For example in the case of the company "CARNIVAL CORPORATION & PLC" I got "CARNIVAL ORATION & PLC" back where the "CO" was removed from the beginning was "CORPORATION"

What I would like to achieve is for the words to be removed only if they are full words on their own. I also tried including spaces before and after the words in the code as follows here:

remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, " -CL B | INC | CORP | CO  | CO. "))

but I still don't get the results I'm looking for.

like image 216
bear_525 Avatar asked Feb 02 '26 21:02

bear_525


2 Answers

I think something like this will work:

strings_to_remove <- c("-CL","B","INC","CORP","CO")
regex<-paste(paste0("(^|\\s+)", strings_to_remove, "\\.?", "(?=\\s+|$)"),collapse="|")
remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, regex))

where "(^|\\s+)" matches the beginning of the string (^ or whitespace) before the string to remove.

"\\.?" matches an optional period

"(?=\\s+|$)") matches more whitespace or the end of the string.

(This answer assumes that you also want to remove "INC." and "CORP." though this wasn't specified in your question).

like image 59
Ethan Bass Avatar answered Feb 05 '26 12:02

Ethan Bass


You can use the regular expression syntax \\b which sets word boundaries and is useful to deal with "full words on their own" as you put it. NOTE you have to list \\bCO\\b. before \\bCO\\b as the matches are done in order. I didn't bother using a word boundary for "-CL B", but if you needed to, you would put the non-letter part of the match outside of the word boundary e.g. -\\bCL B\\b:

library(dplyr)
library(stringr)
# Your sample data
company_name <- data.frame(State = c("x", "y", "z", "r", "w"),
                           `Company Name` = c("COLGATE-PALMOLIVE CO.",
                                              "OLD COPPER CO INC",
                                              "NIKE INC -CL B",
                                              "COMMERCIAL METALS",
                                              "CARNIVAL CORPORATION & PLC"),
                           `number of workers` = c(10, 77, 5, 23, 89),
                           check.names = FALSE)

# Regex with str_remove_all() using word boundaries
company_name %>% 
  mutate(Company_Name = str_remove_all(`Company Name`, 
                                       "-CL B|\\bINC\\b|\\bCORP\\b|\\bCO\\b.|\\bCO\\b"),
         Company_Name = trimws(Company_Name, which = "right")) # Remove trailing whitespaces

#   State               Company Name number of workers               Company_Name
# 1     x      COLGATE-PALMOLIVE CO.                10          COLGATE-PALMOLIVE
# 2     y          OLD COPPER CO INC                77                 OLD COPPER
# 3     z             NIKE INC -CL B                 5                       NIKE
# 4     r          COMMERCIAL METALS                23          COMMERCIAL METALS
# 5     w CARNIVAL CORPORATION & PLC                89 CARNIVAL CORPORATION & PLC

The regex in str_remove_all() moves through the list of string to match in order. As an example, see the result of

str_remove_all(c("CO.", "COA", "COAB"), "CO|CO.|COA|COAB")

versus:

str_remove_all(c("CO.", "COA", "COAB"), "COAB|COA|CO.|CO")

I copied your example column headings verbatim and because they have spaces, they need to be inside backticks when declared. If your actual column names have no spaces, backticks are not needed.

like image 33
L Tyrone Avatar answered Feb 05 '26 13:02

L Tyrone