I have a dataset whereby one column contains a comma separated list of values (some of which are duplicates).
I can use separate() to split this column into multiple but how do then remove any duplicated values.
Each row is unique and I want to keep all rows, just remove duplicate values in each row
This is what I have:
| ID | Column A |
|---|---|
| 1 | "A" |
| 2 | "B,C" |
| 3 | "D,D,E" |
| 4 | "A,D" |
This is what I want:
| ID | Column A | Column B |
|---|---|---|
| 1 | A | NA |
| 2 | B | C |
| 3 | D | E |
| 4 | A | D |
I've tried playing around with unique and duplicated with/without strsplit and/or lapply but just can't figure out what seems like a fairly simple task
Help please?
tmp <- lapply(strsplit(quux[["Column A"]], ","), unique)
tmp <- lapply(tmp, `length<-`, max(lengths(tmp)))
tmp <- do.call(rbind, tmp)
colnames(tmp) <- paste0("col", 1:ncol(tmp))
cbind(quux, tmp)
# ID Column A col1 col2
# 1 1 A A <NA>
# 2 2 B,C B C
# 3 3 D,D,E D E
# 4 4 A,D A D
library(dplyr)
library(tidyr)
separate_longer_delim(quux, `Column A`, delim = ",") |>
distinct() |>
mutate(row = row_number(), .by = ID) |>
pivot_wider(id_cols = ID, names_from = row, values_from = `Column A`)
# # A tibble: 4 × 3
# ID `1` `2`
# <int> <chr> <chr>
# 1 1 A NA
# 2 2 B C
# 3 3 D E
# 4 4 A D
Data
quux <- structure(list(ID = 1:4, "Column A" = c("A", "B,C", "D,D,E", "A,D")), class = "data.frame", row.names = c(NA, -4L))
You can deduplicate the letters using regex and then use tidyr::separate_wider_delim, which may be a bit faster:
library(dplyr)
library(tidyr)
df %>%
mutate(Column.A = gsub("(\\w+)(,\\1)+", "\\1", Column.A)) %>%
separate_wider_delim(Column.A, ",", names_sep = "_", too_few = "align_start")
Output:
# ID Column.A_1 Column.A_2
# <int> <chr> <chr>
# 1 1 A NA
# 2 2 B C
# 3 3 D E
# 4 4 A D
Data
df <- read.table(text = 'ID Column.A
1 "A"
2 "B,C"
3 "D,D,E"
4 "A,D"', h = TRUE)
The other answers are great as well - a quick speed comparison:
# current data size
Unit: milliseconds
expr min lq mean median uq max neval cld
jpsmith 3.117588 3.726158 4.171441 4.186303 4.570429 5.319047 100 a
LMc 3.681888 4.589575 4.980202 4.891460 5.250984 14.182493 100 b
r2evans 9.898874 12.107905 13.131033 12.811930 14.064065 22.485381 100 c
# Bigger data (10,000 observations)
Unit: milliseconds
expr min lq mean median uq max neval cld
jpsmith 22.75258 27.32537 33.82565 30.32410 32.00375 369.73782 100 a
LMc 312.54950 378.78446 417.88440 421.85914 443.31213 671.26054 100 b
r2evans 15.34635 20.68595 23.29642 22.59398 25.24625 40.68818 100 a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With