I'm trying to create dist objects from large distance matrices. I'm running out of memory using stats::as.dist.  For example, I have around 128 Gb available on current machine but as.dist runs out of memory when processing a 73,000 x 73,000 matrix (approx 42Gb).  Given that the final dist object should be less than half the size of the matrix (i.e. it is the lower triangle, stored as a vector) it seems to me that it should be possible to do this calculation in a more memory-efficient way - provided we are careful not to create large intermediate objects and just copy the relevant elements of the input directly to the output.
Looking at the code for getS3method('as.dist', 'default'), I see that it does the calculation using ans <- m[row(m) > col(m)] which requires the creation of row and col matrices of the same dimensionality as the input.
I thought I might be able to improve on this using the algorithm from here to generate the indexes of the lower triangle. Here is my attempt using this method.
as.dist.new = function(dm, diag = FALSE, upper = FALSE) {
  n = dim(dm)[1]
  stopifnot(is.matrix(dm))
  stopifnot(dim(dm)[2] == n)
  k = 1:((n^2 - n)/2)
  j <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k - 1))) / 2)
  i <- j + k - (2 * n - j) * (j - 1) / 2
  idx = cbind(i,j)
  remove(i,j,k)
  gc()
  d = dm[idx]
  class(d) <- "dist"
  attr(d, "Size") <- n
  attr(d, "call") <- match.call()
  attr(d, "Diag") <- diag
  attr(d, "Upper") <- upper
  d
}
This works fine on smaller matrices. Here's a simple example:
N = 10
dm = matrix(runif(N*N), N, N) 
diag(dm) = 0
x = as.dist(dm)
y = as.dist.new(dm)
However, if we create a larger distance matrix it runs into the same memory issues as as.dist.
E.g. this version crashes on my system:
N = 73000
dm = matrix(runif(N*N), N, N)
gc() 
diag(dm) = 0
gc()
as.dist.new(dm)
Does anyone have a suggestion how to perform this operation more efficiently?  R or Rcpp solutions welcome.  NB looking at this answer to a related problem (generating the full distance matrix from 2-column location data) it seems that it may be possible to do this using RcppArmadillo, but I've got no experience of using that.
Well, the logic to traverse the lower triangular is relatively simple, and if you do it in C++ then it can also be fast:
library(Rcpp)
sourceCpp(code='
// [[Rcpp::plugins(cpp11)]]
#include <cstddef> // size_t
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector as_dist(const NumericMatrix& mat) {
  std::size_t nrow = mat.nrow();
  std::size_t ncol = mat.ncol();
  std::size_t size = nrow * (nrow - 1) / 2;
  NumericVector ans(size);
  if (nrow > 1) {
    std::size_t k = 0;
    for (std::size_t j = 0; j < ncol; j++) {
      for (std::size_t i = j + 1; i < nrow; i++) {
        ans[k++] = mat(i,j);
      }
    }
  }
  ans.attr("class") = "dist";
  ans.attr("Size") = nrow;
  ans.attr("Diag") = false;
  ans.attr("Upper") = false;
  return ans;
}
')
as_dist(matrix(1:100, 10, 10))
    1  2  3  4  5  6  7  8  9
2   2                        
3   3 13                     
4   4 14 24                  
5   5 15 25 35               
6   6 16 26 36 46            
7   7 17 27 37 47 57         
8   8 18 28 38 48 58 68      
9   9 19 29 39 49 59 69 79   
10 10 20 30 40 50 60 70 80 90
My current solution is to calculate the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. On large matrices, this is several hundred times faster than the "conventional" pipeline of geosphere::mdist() followed by stats::as.dist() and uses only as much memory as required to store the final dist object.
The following Rcpp source is based on using the functions from here to calculate haversine distance in c++, together with an adaptation of @Alexis method to iterate through the lower triangle elements in c++.
#include <Rcpp.h>
using namespace Rcpp;
double distanceHaversine(double latf, double lonf, double latt, double lont, double tolerance){
  double d;
  double dlat = latt - latf;
  double dlon =  lont - lonf;
  d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
  if(d > 1 && d <= tolerance){
    d = 1;
  }
  return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}
double toRadians(double deg){
  return deg * 0.01745329251;  // PI / 180;
}
//-----------------------------------------------------------
// [[Rcpp::export]]
NumericVector calc_dist(Rcpp::NumericVector lat, 
                        Rcpp::NumericVector lon, 
                        double tolerance = 10000000000.0) {
  std::size_t nlat = lat.size();
  std::size_t nlon = lon.size();
  if (nlat != nlon) throw std::range_error("lat and lon different lengths");
  if (nlat < 2) throw std::range_error("Need at least 2 points");
  std::size_t size = nlat * (nlat - 1) / 2;
  NumericVector ans(size);
  std::size_t k = 0;
  double latf;
  double latt;
  double lonf;
  double lont;
  
  for (std::size_t j = 0; j < (nlat-1); j++) {
    for (std::size_t i = j + 1; i < nlat; i++) {
      latf = toRadians(lat[i]);
      lonf = toRadians(lon[i]);
      latt = toRadians(lat[j]);
      lont = toRadians(lon[j]);
      ans[k++] = distanceHaversine(latf, lonf, latt, lont, tolerance);
    }
  }
  
  return ans;
}
/*** R
as_dist = function(lat, lon, tolerance  = 10000000000.0) {
  dd = calc_dist(lat, lon, tolerance)
  attr(dd, "class") = "dist"
  attr(dd, "Size") = length(lat)
  attr(dd, "call") = match.call()
  attr(dd, "Diag") = FALSE
  attr(dd, "Upper") = FALSE
  return(dd)
}
*/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With