Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find position of matching elements of one vector in another

I have two very long vectors:

a <- sample(1e+08L, size = 1e+09L, replace = TRUE)
b <- sample(1e+08L, size = 1e+09L, replace = TRUE)

I want to generate an integer vector r of length length(a) such that r[i] is the index of a[i] in b.

I tried pmatch(a, b) but it is very slow. Is there a more efficient way?


Desired output for a small example:

a <- c(1, 3, 5, 7, 8)
b <- c(3, 1, 7, 8, 5)
f(a, b)
## [1] 2 1 5 3 4
like image 830
user18373817 Avatar asked Oct 20 '25 15:10

user18373817


2 Answers

Your question mentions pmatch, which performs partial matching of character vectors, but it seems like you want match, which performs exact matching of integer and other vectors.

match is faster, but even faster than match is fastmatch::fmatch:

match(b, a)
fastmatch::fmatch(b, a)

Adding to the benchmarks:

library(fastmatch)

set.seed(1)
a <- sample(1e5, 1e5)
b <- sample(1e5, 1e4)

microbenchmark::microbenchmark(
  match = match(b, a),
  fastmatch = fmatch(b, a),
  check = "identical", 
  times = 100L)
Unit: microseconds
      expr      min       lq      mean    median        uq       max neval
     match 9439.020 9500.602 9659.7537 9519.6260 9555.0090 12394.546   100
 fastmatch  367.606  376.134  398.4347  382.0175  399.1145   614.467   100
like image 90
Maël Avatar answered Oct 22 '25 05:10

Maël


Benchmark:

library(fastmatch)   #fmatch
library(data.table)  #merge
library(collections) #hash

Rcpp::cppFunction('IntegerVector matchC(NumericVector x, NumericVector table) {
  IntegerVector out(x.size(), NA_INTEGER);
  for(int i = 0; i < x.size(); i++) {
    for(int j = 0; j < table.size(); j++) {
      if(x[i] == table[j]) {
        out[i] = j + 1;
        break;
      }
    }
  }
  return out;
}')

Rcpp::cppFunction('IntegerVector matchC2(const std::vector<int>& x, const std::vector<int>& table) {
  IntegerVector out(x.size(), NA_INTEGER);
  std::unordered_map<int, int> lut;
  lut.max_load_factor(table.size()/(double)*max_element(table.begin(), table.end()));
  lut.reserve(table.size());
  for(int i = 0; i < table.size(); i++) lut[table[i]] = i+1;
  for(int i = 0; i < x.size(); i++) {
    auto search = lut.find(x[i]);
    if(search != lut.end()) out[i] = search->second;
  }
  return out;
}')

set.seed(1); a <- sample(1e5, 1e5); b <- sample(1e5, 1e4)
    
bench::mark(
  match = { match(a, b) },
  fmatch = { fmatch(a, b) },
  zx8754.merge = {
    merge(data.table(x = a, rnA = seq_along(a), key = "x"),
          data.table(x = b, rnB = seq_along(b), key = "x"),
          all.x = TRUE)[order(rnA), rnB] },
  sotos.Rcpp = { matchC(a, b) },
  GKi.Rcpp = { matchC2(a, b) },
  user2974951.hash = {
    h = dict(seq_along(b), b)
    sapply(a, h$get, default = NA)},
  "jblood94.[" = `[<-`(NA_integer_, b, seq_along(b))[a]
)

Result

  expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 match              1.86ms   1.99ms   479.      951.3KB     7.98   240     4
2 fmatch             1.03ms   1.12ms   896.     393.34KB     6.00   448     3
3 zx8754.merge       4.71ms   5.02ms   181.       8.05MB    31.6     92    16
4 sotos.Rcpp          2.38s    2.38s     0.420    1.22MB     0        1     0
5 GKi.Rcpp         891.93µs  945.6µs  1018.     393.16KB     7.98   510     4
6 user2974951.hash 127.58ms 133.66ms     6.85     3.44MB    30.8      4    18
7 jblood94.[       227.28µs  244.8µs  3193.     800.85KB    48.0   1597    24
like image 31
10 revs, 5 users 38%GKi Avatar answered Oct 22 '25 04:10

10 revs, 5 users 38%GKi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!