Which rows/columns are duplicates of which others in R matrices?

Question

I have a matrix with many rows and columns, of the nature

x <- matrix(c(1, 1, 3, 3, 55, 55, 1, 3, 3, 1,
              1, 1, 3, 3, 55, 55, 1, 3, 9, 1), ncol = 2)

My problem

Within each group of duplicate rows, (i.e. each set of identical rows), I wish to identify the first row index and assign it to all occurences within that group. For example, there are several duplicate rows with 1 in both columns (on rows 1, 2, 7, 10). On each of these rows I want the first row index, i.e. 1.

x
#       [,1] [,2]
#  [1,]    1    1 # first row of 1-1. Assign its row index, 1, to all 1-1 rows
#  [2,]    1    1
#  [3,]    3    3 # first row of 3-3. Assign its row index, 3, to all 3-3 rows
#  [4,]    3    3
#  [5,]   55   55 # first row of 55-55. Assign its row index, 5, to all 55-55 rows
#  [6,]   55   55 
#  [7,]    1    1
#  [8,]    3    3
#  [9,]    3    9 # first (and only) row of 3-9; row index 9
# [10,]    1    1

Desired result:

1 1 3 3 5 5 1 3 9 1

My attempt

The best I've come up with is a convoluted approach based on duplicated and for loops, that is neither efficient nor elegant. I'm also aware of possible solutions for data.frames; those involving concatenating rows into strings are quite resource-intensive too.

# Identify duplicates
duplicate <- duplicated(x, MARGIN = 1)

# Identify first occurrence of each duplicate
firstDup <- duplicated(x, MARGIN = 1, fromLast = TRUE) & !duplicate
indices <- which(firstDup)

# Initialize index for unique rows
index <- seq_len(dim(x)[1])

cf <- duplicate
for (i in indices) {
  # Duplicates must occur after first occurrence
  cf[seq_len(i)] <- FALSE
  dups <- apply(x[cf, , drop = FALSE], 1L, identical, x[i, ])
  index[which(cf)[dups]] <- i
}
index

Is there an elegant solution using base R?

39 revs, 3 users 57%Henrik · Accepted Answer

TL;DR

For integer matrices of equal size but different shapes (5e+06-by-2, 5e+05-by-20, 5000-by-2000), containing integers from 1 to 10, the fastest base answer tested was grouping/match, suggested in a comment by @alexis_laz. The fastest non-base answer was data.table::frank/match, though grouping/match was comparable in all cases, even outperforming the data.table answer in the 5000-by-2000 case.

Note that results may vary for double matrices or integer matrices with greater range, and depending on the number of threads made available to data.table. [TODO?]

Background

@MikaelJagan's asplit/match(<list>, <list>) answer seems like "an elegant solution using base R". However, ?match warns:

Matching for lists is potentially very slow and best avoided except in simple cases.

Given that the OP has "a matrix with many rows and columns", we wanted to compare the performance of the asplit/match(<list>, <list>) answer to that of the other base answers:

@Onyambu's paste/match(<chr>, <chr>) answer;
@ThomasIsCoding's interaction/match(<int>, <int>) answer;
@alexis_laz's grouping/match(<int>, <int>) answer.

We benchmarked these alongside some non-base answers, which we used as points of reference (recognizing that the OP asked for base only):

@MikaelJagan's Rcpp answer;
@Henrik's data.table answers:
1. A self-join passing which = TRUE and mult = "first" to [.data.table;
2. Two approaches based on row ranking, differing according to how ties are handled:
  - frank(ties.method = "average")/match(<dbl>, <dbl>),
  - frank(ties.method = "dense")/match(<int>, <int>).

Setup

library(microbenchmark)
library(data.table)
getDTthreads() # 4

f_asplit <- function(x) {
  l <- asplit(x, 1L)
  match(l, l) }

f_paste <- function(x) {
  s <- do.call(paste, as.data.frame(x))
  match(s, s) }

f_interaction <- function(x) {
  z <- as.integer(interaction(as.data.frame(x)))
  match(z, z) }

f_grouping <- function(x) {
  g <- do.call(grouping, as.data.frame(x))
  o <- order(g, method = "radix")
  e <- attr(g, "ends")
  z <- rep.int(seq_along(e), c(e[1L], e[-1L] - e[-length(e)]))[o]
  match(z, z) }

f_join <- function(x) {
  d <- as.data.table(x)
  d[d, on = names(d), mult = "first", which = TRUE] }

f_frank_average <- function(x) {
  d <- as.data.table(x)
  r <- frank(d, ties.method = "average")
  match(r, r) }

f_frank_dense <- function(x) {
  d <- as.data.table(x)
  r <- frank(d, ties.method = "dense")
  match(r, r) }

Rcpp::sourceCpp('<copy source code from @MikaelJagan\'s answer here>')

Benchmarking

Many rows, few columns

We first assessed performance using a 5e+06-by-2 integer matrix:

set.seed(1L)
x <- matrix(sample(10L, size = 1e+07L, replace = TRUE), ncol = 2L)

microbenchmark(
  f_asplit(x), 
  f_paste(x), 
  f_interaction(x),
  f_grouping(x),
  f_join(x),
  f_frank_average(x),
  f_frank_dense(x),
  f_rcpp(x),
  times = 10L,
  check = "identical",
  setup = gc(FALSE)
)

Unit: milliseconds
               expr         min          lq        mean     median          uq         max neval
        f_asplit(x) 17369.93905 18861.91195 19070.21298 19013.0180 19207.29194 22420.71085    10
         f_paste(x)   502.63884   507.35077   509.01823   509.2443   511.72301   515.10083    10
   f_interaction(x)   234.19311   236.52494   241.80098   238.7392   242.32923   259.75644    10
      f_grouping(x)   182.25226   182.89358   187.09642   184.6124   187.10444   208.15532    10
          f_join(x)   119.43460   120.86829   123.16607   122.9332   125.07169   128.44722    10
 f_frank_average(x)   104.40150   107.53607   111.00268   108.5597   116.80375   121.83675    10
   f_frank_dense(x)    86.60926    88.29555    91.42976    90.4716    92.32413    99.30659    10
          f_rcpp(x)   459.02304   464.79855   472.43669   468.2492   470.25508   523.06734    10

f_asplit is two orders of magnitude slower than the base alternatives. f_grouping is the fastest base answer, but f_frank_dense is faster by a factor of about 2 (and fastest overall).

Fewer rows, more columns

The results above do not generalize to all integer matrix inputs. For example, f_interaction scales very poorly with ncol(x): the number of possible interactions is u^ncol(x) if each column of x has u unique elements.

For this reason, we performed a second benchmark, this time considering a matrix with fewer rows (5e+05) and more columns (20).

set.seed(1L)
x <- matrix(sample(10L, size = 1e+07L, replace = TRUE), ncol = 20L)

An initial test of f_interaction resulted in a memory allocation error, so it was excluded from the benchmark.

system.time(f_interaction(x))

Error: cannot allocate vector of size 7.5 Gb
Timing stopped at: 173.2 6.05 200.4

microbenchmark(
  f_asplit(x),
  f_paste(x),
  ## f_interaction(x),
  f_grouping(x), 
  f_join(x),
  f_frank_average(x),
  f_frank_dense(x),
  f_rcpp(x),
  times = 10L,
  check = "identical",
  setup = gc(FALSE)
)

Unit: milliseconds
               expr          min           lq         mean       median           uq          max neval
        f_asplit(x)   5416.08762   5681.23523   5731.89246   5732.31779   5905.44517   5913.77141    10
         f_paste(x)    592.92990    604.15083    629.31101    623.78679    637.81814    724.83871    10
      f_grouping(x)     63.89522     64.14134     65.42723     65.11530     66.00557     68.06045    10
          f_join(x)    340.73722    342.18096    353.35774    352.08861    359.88480    382.13480    10
 f_frank_average(x)     69.90496     70.81840     72.29819     72.04409     73.11977     77.44347    10
   f_frank_dense(x)     52.58033     53.33760     54.42029     54.01672     55.63532     56.99664    10
          f_rcpp(x) 184096.21999 184816.36584 185774.76817 186218.58335 186696.31674 186781.24972    10

f_grouping remains the fastest base answer. Notably, it is now faster than f_paste by a full order of magnitude and only marginally slower than f_frank_dense.

Even fewer rows, even more columns

We performed a final benchmark excluding the slowest answers in the last round (f_asplit and f_rcpp), now considering a 5000-by-2000 integer matrix:

set.seed(1L)
x <- matrix(sample(10L, size = 1e+07L, replace = TRUE), ncol = 2000L)

microbenchmark(
  ## f_asplit(x),
  f_paste(x),
  ## f_interaction(x),
  f_grouping(x), 
  f_join(x),
  f_frank_average(x),
  f_frank_dense(x),
  ## f_rcpp(x),
  times = 10L,
  check = "identical",
  setup = gc(FALSE)
)

Unit: milliseconds
               expr        min         lq       mean     median         uq        max neval
         f_paste(x) 1067.47994 1075.45148 1083.17391 1080.72997 1089.74027 1102.45249    10
      f_grouping(x)   19.24007   19.50026   19.86404   19.79002   20.25302   20.60127    10
          f_join(x)  616.66706  621.29854  630.61460  628.16315  636.39097  650.16180    10
 f_frank_average(x)   59.82007   61.41706   62.68610   62.99318   64.56520   64.88463    10
   f_frank_dense(x)   58.03648   60.59857   63.50526   61.99278   66.03694   71.30638    10

Now f_grouping is fastest overall, and faster than f_frank_dense by a factor of about 3.

KU99 · Answer

If you have large matrix, then the following solution might suffice:

l <- do.call(paste, data.frame(x))
match(l, l)
[1] 1 1 3 3 5 5 1 3 9 1

Which rows/columns are duplicates of which others in R matrices?

Tags:

r

duplicates

matrix

My problem

My attempt

Martin Smith

2 Answers

TL;DR

Background

Setup

Benchmarking

Many rows, few columns

Fewer rows, more columns

Even fewer rows, even more columns

39 revs, 3 users 57%Henrik

KU99

Recent Activity

Donate For Us

Which rows/columns are duplicates of which others in R matrices?

Tags:

r

duplicates

matrix

My problem

My attempt

Martin Smith

2 Answers

TL;DR

Background

Setup

Benchmarking

Many rows, few columns

Fewer rows, more columns

Even fewer rows, even more columns

39 revs, 3 users 57%Henrik

KU99

Related questions

Recent Activity

Donate For Us