Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compare the information between two matrices R

Tags:

r

matrix

I have two matrices, one is generated out of the other by deleting some rows. For example:

m = matrix(1:18, 6, 3)
m1 = m[c(-1, -3, -6),]

Suppose I do not know which rows in m were eliminated to create m1, how should I find it out by comparing the two matrices? The result I want looks like this:

1, 3, 6

The actual matrix I am dealing with is very big. I was wondering if there is any efficient way of conducting it.

like image 492
user7453767 Avatar asked Jan 27 '26 10:01

user7453767


2 Answers

Here are some approaches:

1) If we can assume that there are no duplicated rows in m -- this is the case in the example in the question -- then:

which(tail(!duplicated(rbind(m1, m)), nrow(m)))
## [1] 1 3 6

2) Transpose m and m1 giving tm and tm1 since it is more efficient to work on columns than rows.

Define match_indexes(i) which returns a vector r such that each row in m[r, ] matches m1[i, ].

Apply that to each i in 1:n1 and remove the result from 1:n.

n <- nrow(m); n1 <- nrow(m1)
tm <- t(m); tm1 <- t(m1)

match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1)
setdiff(1:n, unlist(lapply(1:n1, match_indexes)))
## [1] 1 3 6

3) Calculate an interaction vector for each matrix and then use setdiff and finally match to get the indexes:

i <- interaction(as.data.frame(m))
i1 <- interaction(as.data.frame(m1))
match(setdiff(i, i1), i)
## [1] 1 3 6

Added If there can be duplicates in m then (1) and (3) will only return the first of any multiply occurring row in m not in m1.

m <- matrix(1:18, 6, 3)
m1 <- m[c(2, 4, 5),]
m <- rbind(m, m[1:2, ])
# 1
which(tail(!duplicated(rbind(m1, m)), nrow(m)))
## 1 3 6

# 2
n <- nrow(m); n1 <- nrow(m1)
tm <- t(m); tm1 <- t(m1)
match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1)
setdiff(1:n, unlist(lapply(1:n1, match_indexes)))
## 1 3 6 7

# 3
i <- interaction(as.data.frame(m))
i1 <- interaction(as.data.frame(m1))
match(setdiff(i, i1), i)
## 1 3 6
like image 166
G. Grothendieck Avatar answered Jan 28 '26 23:01

G. Grothendieck


A possible way is to represent each row as a string:

x1 <- apply(m, 1, paste0, collapse = ';')
x2 <- apply(m1, 1, paste0, collapse = ';')
which(!x1 %in% x2)
# [1] 1 3 6

Some benchmark with a large matrix using my solution and G. Grothendieck's solutions:

set.seed(123)
m <- matrix(rnorm(20000 * 5000), nrow = 20000)
m1 <- m[-sample.int(20000, 1000), ]

system.time({
    which(tail(!duplicated(rbind(m1, m)), nrow(m)))
})
#    user  system elapsed
# 339.888   2.368 342.204
system.time({
    x1 <- apply(m, 1, paste0, collapse = ';')
    x2 <- apply(m1, 1, paste0, collapse = ';')
    which(!x1 %in% x2)
})
#    user  system elapsed
# 395.428   0.568 395.955

system({
    n <- nrow(m); n1 <- nrow(m1)
    tm <- t(m); tm1 <- t(m1)

    match_indexes <- function(i) which(colSums(tm1[, i] == tm) == n1)
    setdiff(1:n, unlist(lapply(1:n1, match_indexes)))
})
# > 15 min, not finish


system({
    i <- interaction(as.data.frame(m))
    i1 <- interaction(as.data.frame(m1))
    match(setdiff(i, i1), i)
})
# run out of memory. My 32G RAM machine crashed.
like image 39
mt1022 Avatar answered Jan 28 '26 22:01

mt1022



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!