I have two columns, would like to retain only the non commutative rows.For the data below my output should contain one combination of (1 2). i.e. for my query (1 2) is same as (2 1). Is there a simple way to do it in R. Already tried transposing. and retaining the upper traingular matrix. but it becomes a pain re transposing back the data.
A B prob
1 2 0.1
1 3 0.2
1 4 0.3
2 1 0.3
2 3 0.1
2 4 0.4
My final output should be:
A B prob
1 2 0.1
1 3 0.2
1 4 0.3
2 3 0.1
2 4 0.4
We can independently sort() each row and then use !duplicated() to find which rows to preserve:
df[!duplicated(t(apply(df[1:2],1L,sort))),];
## A B prob
## 1 1 2 0.1
## 2 1 3 0.2
## 3 1 4 0.3
## 5 2 3 0.1
## 6 2 4 0.4
Data
df <- data.frame(A=c(1L,1L,1L,2L,2L,2L),B=c(2L,3L,4L,1L,3L,4L),prob=c(0.1,0.2,0.3,0.3,0.1,0.4
));
The first step is to extract just the two columns of interest:
df[1:2];
## A B
## 1 1 2
## 2 1 3
## 3 1 4
## 4 2 1
## 5 2 3
## 6 2 4
Then we independently sort each row with apply() and sort():
apply(df[1:2],1L,sort);
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 1 1 2 2
## [2,] 2 3 4 2 3 4
As you can see, apply() returns its results in an unexpected transposition, so we have to fix it with t() to prepare for the upcoming duplicated() call:
t(apply(df[1:2],1L,sort));
## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 1 4
## [4,] 1 2
## [5,] 2 3
## [6,] 2 4
Now we can use duplicated() to get a logical vector indicating which rows are duplicates of previous rows:
duplicated(t(apply(df[1:2],1L,sort)));
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
We then invert the logical vector with a negation, to get just those rows that are not duplicates of any previous rows:
!duplicated(t(apply(df[1:2],1L,sort)));
## [1] TRUE TRUE TRUE FALSE TRUE TRUE
Finally we can use the resulting logical vector to index out just those rows of df that are not duplicates of any previous rows:
df[!duplicated(t(apply(df[1:2],1L,sort))),];
## A B prob
## 1 1 2 0.1
## 2 1 3 0.2
## 3 1 4 0.3
## 5 2 3 0.1
## 6 2 4 0.4
Therefore, the first occurrence of every set of post-sort duplicates will be retained, the remainder will be removed.
Excellent suggestion from @RichardScriven; we can replace the t() call with the MARGIN argument of duplicated(), which will likely be slightly faster:
df[!duplicated(apply(df[1:2],1L,sort),MARGIN=2L),];
## A B prob
## 1 1 2 0.1
## 2 1 3 0.2
## 3 1 4 0.3
## 5 2 3 0.1
## 6 2 4 0.4
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the pmin(A, B) and pmax(A,B), if the number of rows is greater than 1, we get the first row or else return the rows.
library(data.table)
setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
Or we can just used duplicated on the pmax, pmin output to return a logical index and subset the data based on that.
setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With