Remove specific words from a column

Question

I'm looking to remove specific words (for example "co" "INC" etc) from a column in data without removing the same letters from other words in the same column. In other words, I only want to remove these words when they are free standing.

This is a an example of what a few rows and columns of the company_name data looks like:

State	Company Name	number of workers
x	COLGATE-PALMOLIVE CO.	10
y	OLD COPPER CO INC	77
z	NIKE INC -CL B	5
r	COMMERCIAL METALS	23
w	CARNIVAL CORPORATION & PLC	89

I used the following code to remove the words:

remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, "-CL B|INC|CORP|CO|CO."))

What I got in return was not what was expecting. For example in the case of the company "CARNIVAL CORPORATION & PLC" I got "CARNIVAL ORATION & PLC" back where the "CO" was removed from the beginning was "CORPORATION"

What I would like to achieve is for the words to be removed only if they are full words on their own. I also tried including spaces before and after the words in the code as follows here:

remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, " -CL B | INC | CORP | CO  | CO. "))

but I still don't get the results I'm looking for.

Ethan Bass · Accepted Answer

I think something like this will work:

strings_to_remove <- c("-CL","B","INC","CORP","CO")
regex<-paste(paste0("(^|\s+)", strings_to_remove, "\.?", "(?=\s+|$)"),collapse="|")
remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, regex))

where "(^|\s+)" matches the beginning of the string (^ or whitespace) before the string to remove.

"\.?" matches an optional period

"(?=\s+|$)") matches more whitespace or the end of the string.

(This answer assumes that you also want to remove "INC." and "CORP." though this wasn't specified in your question).

L Tyrone · Answer

You can use the regular expression syntax \b which sets word boundaries and is useful to deal with "full words on their own" as you put it. NOTE you have to list \bCO\b. before \bCO\b as the matches are done in order. I didn't bother using a word boundary for "-CL B", but if you needed to, you would put the non-letter part of the match outside of the word boundary e.g. -\bCL B\b:

library(dplyr)
library(stringr)
# Your sample data
company_name <- data.frame(State = c("x", "y", "z", "r", "w"),
                           `Company Name` = c("COLGATE-PALMOLIVE CO.",
                                              "OLD COPPER CO INC",
                                              "NIKE INC -CL B",
                                              "COMMERCIAL METALS",
                                              "CARNIVAL CORPORATION & PLC"),
                           `number of workers` = c(10, 77, 5, 23, 89),
                           check.names = FALSE)

# Regex with str_remove_all() using word boundaries
company_name %>% 
  mutate(Company_Name = str_remove_all(`Company Name`, 
                                       "-CL B|\bINC\b|\bCORP\b|\bCO\b.|\bCO\b"),
         Company_Name = trimws(Company_Name, which = "right")) # Remove trailing whitespaces

#   State               Company Name number of workers               Company_Name
# 1     x      COLGATE-PALMOLIVE CO.                10          COLGATE-PALMOLIVE
# 2     y          OLD COPPER CO INC                77                 OLD COPPER
# 3     z             NIKE INC -CL B                 5                       NIKE
# 4     r          COMMERCIAL METALS                23          COMMERCIAL METALS
# 5     w CARNIVAL CORPORATION & PLC                89 CARNIVAL CORPORATION & PLC

The regex in str_remove_all() moves through the list of string to match in order. As an example, see the result of

str_remove_all(c("CO.", "COA", "COAB"), "CO|CO.|COA|COAB")

versus:

str_remove_all(c("CO.", "COA", "COAB"), "COAB|COA|CO.|CO")

I copied your example column headings verbatim and because they have spaces, they need to be inside backticks when declared. If your actual column names have no spaces, backticks are not needed.

Remove specific words from a column

Tags:

string

r

data-cleaning

bear_525

2 Answers

Ethan Bass

L Tyrone

Recent Activity

Donate For Us

Remove specific words from a column

Tags:

string

r

data-cleaning

bear_525

2 Answers

Ethan Bass

L Tyrone

Related questions

Recent Activity

Donate For Us