Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove rows if value in col1 is in col2 and value in col2 is in col1

I am working in base R (no tidy verse please). I have the following dataframe in R:

> geneA  
[1] "GNG7"  "GNG7"  "GNG7"  "GNG12" "GNG12" "GNG12" "GNG2"  "GNG2"  "GNG2"  
[10] "GNG5"  "GNG5"  "GNG5" 

> geneB  
 [1] "GNG12" "GNG5"  "GNG2"  "GNG7"  "GNG5"  "GNG2"  "GNG5"  "GNG12" "GNG7"  
[10] "GNG12" "GNG7"  "GNG2" 

(some data wrangling to create extra column GENE_PAIR)

> GNGdata
    geneA geneB  GENE_PAIR  
1  GNG7 GNG12 GNG7;GNG12  
2  GNG7  GNG5  GNG7;GNG5  
3  GNG7  GNG2  GNG7;GNG2  
4 GNG12  GNG7 GNG12;GNG7  
5 GNG12  GNG5 GNG12;GNG5  
6 GNG12  GNG2 GNG12;GNG2  
7  GNG2  GNG5  GNG2;GNG5  
8  GNG2 GNG12 GNG2;GNG12  
9  GNG2  GNG7  GNG2;GNG7  
10  GNG5 GNG12 GNG5;GNG12  
11  GNG5  GNG7  GNG5;GNG7  
12  GNG5  GNG2  GNG5;GNG2

As you can see, there are duplicated GENE_PAIRs (1 and 4, 2 and 11, etc). I want to keep only 1 pair. For instance, the pair GNG7;GNG12 exists, so I want to exclude the pair GNG12,GNG7 from my new dataframe.

I am expecting this result:

> GNGdata_filtered  
    geneA geneB  GENE_PAIR  
1  GNG7 GNG12 GNG7;GNG12  
2  GNG7  GNG5  GNG7;GNG5  
3  GNG7  GNG2  GNG7;GNG2  
4 GNG12  GNG5 GNG12;GNG5  
5 GNG12  GNG2 GNG12;GNG2  
66  GNG2  GNG5  GNG2;GNG5  
like image 886
moirasanti Avatar asked Oct 15 '25 10:10

moirasanti


1 Answers

You could create an index (idx) of duplicated genes after using mixedsort, then index the data frame on those that are not duplicated:

idx <- duplicated(unlist(lapply(lapply(strsplit(df$GENE_PAIR,";"), gtools::mixedsort),
       function(x) paste(x, collapse = ";"))))

df[!idx,]

#  geneA geneB  GENE_PAIR
# 1  GNG7 GNG12 GNG7;GNG12
# 2  GNG7  GNG5  GNG7;GNG5
# 3  GNG7  GNG2  GNG7;GNG2
# 5 GNG12  GNG5 GNG12;GNG5
# 6 GNG12  GNG2 GNG12;GNG2
# 7  GNG2  GNG5  GNG2;GNG5
like image 83
jpsmith Avatar answered Oct 18 '25 08:10

jpsmith