Is there a faster way to do this Cholesky factorization in Rcpp/c++?

Question

This is a problem arose in my Markov Chain Monte Carlo (MCMC) algorithm. And I feel this problem is quite commonly encountered especially in hierarchical Gaussian models. Hence it would be great if there is a much more efficient solution. So the problem is like this:

I have many positive integer vectors xi, for i from 1 to n, a p.s.d. matrix A and a p.s.d. matrix B. For every xi I want to compute the following Cholesky factorization:

chol( kron( diagmat( xi ), A ) + B )

So kron( diagmat( xi ), A ) + B is the covariance matrix for a multivariate Gaussian and I want to sample from this Gaussian hence need the Cholesky factorization of it. The dimension for A and B are not small and I have a large n hence computing the above Cholesky factorization for all xi is really time-consuming. Below is the Rcpp function I wrote using RcppArmadillo:

#include <cmath>
#include <Rmath.h>
#include "RcppArmadillo.h"
// [[Rcpp::depends(RcppArmadillo)]]
using namespace arma;
using namespace Rcpp;

// [[Rcpp::export]]
mat test_C(mat A, mat B, mat X){
  int n_x = X.n_cols;
  int d_B = B.n_rows;
  mat sample(d_B, n_x);
  mat mS_chol_inv;
  for (int i = 0; i < n_x; i++){
    mS_chol_inv = inv(trimatu( chol(kron(diagmat( X.col(i) ),A) + B) ));
    sample.col(i) = mS_chol_inv*randn(d_B);
  }
  return(sample);
}

I also test the computational efficiency using the following code comparing it to its R counterpart:

test_R <- function(A,B,X){
  n_x <- ncol(X)
  d_B <- ncol(B)
  res <- sapply(1:n_x, function(x){
    mS_chol <- chol( kronecker( diag(X[,x]),A ) + B )
    return( mS_chol%*%as.matrix( rnorm(d_B) ) )
  })
  return(res)
}

# Simulate Data
R1 <- matrix(rnorm(24*2),24,2)
A <- R1%*%t(R1) + 0.1*diag(24)
R2 <- matrix(rnorm(264*2),264,2)
B <- R2%*%t(R2) + 0.1*diag(264)
X <- matrix(rpois(11*2178, 5),11,2178)

res <- benchmark(res_R <- test_R(A, B, X),
             res_C <- test_C(A, B, X),
             columns=c("test", "replications", "elapsed", "relative"),
             order="relative",
             replications = 2)

And the result is as follows

> print(res) 
                      test replications elapsed relative
1 res_R <- test_R(A, B, X)            2  18.920    1.000
2 res_C <- test_C(A, B, X)            2  20.724    1.095

As can be seen, a single run is approximately 10 second, and this is simply not feasible in a MCMC algorithm. Also, since the chol() dominates the computational complexity, the improvement of using Rcpp over pure R is trivial. But maybe I did not write the most efficient code? So any advice?

Since the matrix inside chol() is very structured and the only thing that is varying is xi, maybe there is some algebra trick that I do not know that can solve this efficiently? I have posted this as a linear-algebra question under Mathematics and here is the link. Unfortunately so far I have not received any solution, people do point out that this is embarrassingly parallel.

Any advice on code/algebra will be helpful! Thanks ahead for your time.

Severin Pappadeux · Accepted Answer

You may try to use Microsoft R from here, artist formerly known as RRO. It integrates with multi-threaded Intel MKL library (same place to find and install it) and on Intel hardware matrix operations are quite fast.

Disclaimer: YMMV

mkln · Answer

(Rcpp) Eigen or Armadillo for Cholesky?

I can't post this as a comment to @zdebruine's answer, but I'd like to figure out what's the best practice for Cholesky using RcppArmadillo/Eigen.

Note: this SO question is the top Google search result for "fastest Cholesky using Armadillo or Eigen". I think it's useful for people to evaluate pros and cons of both

Here's a first attempt at a benchmark (the CPU here is an AMD Ryzen 9 5950X). I'll edit and update based on suggestions.

sessionInfo() returns:

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64_lin/libmkl_rt.so

Source:

arma_source <- '
//[[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

//[[Rcpp::export]]
arma::mat arma_chol(const arma::mat& X){
  return arma::chol(X, "lower");
}
'

eigen_source <- '
//[[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>

//[[Rcpp::export]]
Eigen::MatrixXd eigen_chol(const Eigen::MatrixXd& X){
  Eigen::LLT<Eigen::MatrixXd> Xllt(X); 
  return Xllt.matrixL();
}
'

Rcpp::sourceCpp(code=arma_source)
Rcpp::sourceCpp(code=eigen_source)


nr <- 2000
Gen <- matrix(rnorm(nr^2), ncol=nr)
X <- crossprod(Gen)

RhpcBLASctl::blas_set_num_threads(1)
RhpcBLASctl::omp_set_num_threads(1)

rbenchmark::benchmark(
  L_arma = arma_chol(X),
  L_eigen = eigen_chol(X),
  U_R = chol(X),
  replications=100)

Results:

     test replications elapsed relative user.self sys.self user.child sys.child
1  L_arma          100   4.770    1.056     4.635    0.136          0         0
2 L_eigen          100  12.908    2.856    12.768    0.141          0         0
3     U_R          100   4.519    1.000     4.403    0.116          0         0

I'm not familiar with ways to make the Eigen code more efficient. I find it odd that there's a difference at all. The disparity grows if I allow for multithreading, as Eigen only runs on 1 thread in my current config. The fact that R's chol() is the same as arma's is unsurprising, as both use MKL's BLAS

If we add #define EIGEN_USE_BLAS we get the following

     test replications elapsed relative user.self sys.self user.child sys.child
1  L_arma          100   5.556    1.039     4.958    0.597          0         0
2 L_eigen          100   6.257    1.170     5.621    0.631          0         0
3     U_R          100   5.347    1.000     4.555    0.790          0         0

which pretty much makes everything use the same BLAS functions

Is there a faster way to do this Cholesky factorization in Rcpp/c++?

Tags:

r

matrix

linear-algebra

rcpp

armadillo

Bayesric

2 Answers

Severin Pappadeux

mkln

Recent Activity

Donate For Us

Is there a faster way to do this Cholesky factorization in Rcpp/c++?

Tags:

r

matrix

linear-algebra

rcpp

armadillo

Bayesric

2 Answers

Severin Pappadeux

mkln

Related questions

Recent Activity

Donate For Us