Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate within string R

Tags:

r

duplicates

I have a dataset whereby one column contains a comma separated list of values (some of which are duplicates).

I can use separate() to split this column into multiple but how do then remove any duplicated values.

Each row is unique and I want to keep all rows, just remove duplicate values in each row

This is what I have:

ID Column A
1 "A"
2 "B,C"
3 "D,D,E"
4 "A,D"

This is what I want:

ID Column A Column B
1 A NA
2 B C
3 D E
4 A D

I've tried playing around with unique and duplicated with/without strsplit and/or lapply but just can't figure out what seems like a fairly simple task

Help please?

like image 767
O Stanley Avatar asked May 14 '26 09:05

O Stanley


2 Answers

Base R

tmp <- lapply(strsplit(quux[["Column A"]], ","), unique)
tmp <- lapply(tmp, `length<-`, max(lengths(tmp)))
tmp <- do.call(rbind, tmp)
colnames(tmp) <- paste0("col", 1:ncol(tmp))
cbind(quux, tmp)
#   ID Column A col1 col2
# 1  1        A    A <NA>
# 2  2      B,C    B    C
# 3  3    D,D,E    D    E
# 4  4      A,D    A    D

Another tidyverse option:

library(dplyr)
library(tidyr)
separate_longer_delim(quux, `Column A`, delim = ",") |>
  distinct() |>
  mutate(row = row_number(), .by = ID) |>
  pivot_wider(id_cols = ID, names_from = row, values_from = `Column A`)
# # A tibble: 4 × 3
#      ID `1`   `2`  
#   <int> <chr> <chr>
# 1     1 A     NA   
# 2     2 B     C    
# 3     3 D     E    
# 4     4 A     D    

Data

quux <- structure(list(ID = 1:4, "Column A" = c("A", "B,C", "D,D,E", "A,D")), class = "data.frame", row.names = c(NA, -4L))
like image 164
r2evans Avatar answered May 17 '26 17:05

r2evans


You can deduplicate the letters using regex and then use tidyr::separate_wider_delim, which may be a bit faster:

library(dplyr)
library(tidyr)

df %>%
  mutate(Column.A = gsub("(\\w+)(,\\1)+", "\\1", Column.A)) %>%
  separate_wider_delim(Column.A, ",", names_sep = "_", too_few = "align_start")

Output:

#      ID Column.A_1 Column.A_2
#   <int> <chr>      <chr>     
# 1     1 A          NA        
# 2     2 B          C         
# 3     3 D          E         
# 4     4 A          D  

Data

df <- read.table(text = 'ID Column.A
1   "A"
2   "B,C"
3   "D,D,E"
4   "A,D"', h = TRUE)

The other answers are great as well - a quick speed comparison:

# current data size

Unit: milliseconds
    expr      min        lq      mean    median        uq       max neval cld
 jpsmith 3.117588  3.726158  4.171441  4.186303  4.570429  5.319047   100 a  
     LMc 3.681888  4.589575  4.980202  4.891460  5.250984 14.182493   100  b 
 r2evans 9.898874 12.107905 13.131033 12.811930 14.064065 22.485381   100   c

# Bigger data (10,000 observations)

Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
 jpsmith  22.75258  27.32537  33.82565  30.32410  32.00375 369.73782   100  a 
     LMc 312.54950 378.78446 417.88440 421.85914 443.31213 671.26054   100   b
 r2evans  15.34635  20.68595  23.29642  22.59398  25.24625  40.68818   100  a 
like image 20
jpsmith Avatar answered May 17 '26 16:05

jpsmith



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!