I am working in base R (no tidy verse please). I have the following dataframe in R:
> geneA
[1] "GNG7" "GNG7" "GNG7" "GNG12" "GNG12" "GNG12" "GNG2" "GNG2" "GNG2"
[10] "GNG5" "GNG5" "GNG5"
> geneB
[1] "GNG12" "GNG5" "GNG2" "GNG7" "GNG5" "GNG2" "GNG5" "GNG12" "GNG7"
[10] "GNG12" "GNG7" "GNG2"
(some data wrangling to create extra column GENE_PAIR)
> GNGdata
geneA geneB GENE_PAIR
1 GNG7 GNG12 GNG7;GNG12
2 GNG7 GNG5 GNG7;GNG5
3 GNG7 GNG2 GNG7;GNG2
4 GNG12 GNG7 GNG12;GNG7
5 GNG12 GNG5 GNG12;GNG5
6 GNG12 GNG2 GNG12;GNG2
7 GNG2 GNG5 GNG2;GNG5
8 GNG2 GNG12 GNG2;GNG12
9 GNG2 GNG7 GNG2;GNG7
10 GNG5 GNG12 GNG5;GNG12
11 GNG5 GNG7 GNG5;GNG7
12 GNG5 GNG2 GNG5;GNG2
As you can see, there are duplicated GENE_PAIRs (1 and 4, 2 and 11, etc). I want to keep only 1 pair. For instance, the pair GNG7;GNG12 exists, so I want to exclude the pair GNG12,GNG7 from my new dataframe.
I am expecting this result:
> GNGdata_filtered
geneA geneB GENE_PAIR
1 GNG7 GNG12 GNG7;GNG12
2 GNG7 GNG5 GNG7;GNG5
3 GNG7 GNG2 GNG7;GNG2
4 GNG12 GNG5 GNG12;GNG5
5 GNG12 GNG2 GNG12;GNG2
66 GNG2 GNG5 GNG2;GNG5
You could create an index (idx
) of duplicated genes after using mixedsort
, then index the data frame on those that are not duplicated:
idx <- duplicated(unlist(lapply(lapply(strsplit(df$GENE_PAIR,";"), gtools::mixedsort),
function(x) paste(x, collapse = ";"))))
df[!idx,]
# geneA geneB GENE_PAIR
# 1 GNG7 GNG12 GNG7;GNG12
# 2 GNG7 GNG5 GNG7;GNG5
# 3 GNG7 GNG2 GNG7;GNG2
# 5 GNG12 GNG5 GNG12;GNG5
# 6 GNG12 GNG2 GNG12;GNG2
# 7 GNG2 GNG5 GNG2;GNG5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With