Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate multivariate normal distribution function in R

Here's what I tried, making use of the mvtnorm package

Sample Dataset

library(mvtnorm)

set.seed(2357)
df <- data.frame(
  x = rnorm(1000, mean=80, sd=20),
  y = rnorm(1000, mean=0, sd=5),
  z = rnorm(1000, mean=0, sd=5)
)

head(df)
      x      y       z
1 70.38  1.307  0.2005
2 59.76  5.781 -3.5095
3 54.14 -1.313 -1.9022
4 79.91  7.754 -6.2076
5 87.07  1.389  1.1065
6 75.89  1.684  6.2979

Fit multivariate normal dist and check P(x <= 80) ~ 0.5

# Get the dimension means and correlation matrix
means <- c(x=mean(df$x), y=mean(df$y), z=mean(df$z))
corr <- cor(df)

# Check P(x <= 80)
sum(df$x <= 80)/nrow(df)  # 0.498
pmvnorm(lower=-Inf, upper=c(80, Inf, Inf), mean=means, corr=corr)  # 0.8232

Why is the fitted result 0.82? Where did I go wrong?

like image 501
Ben Avatar asked Jan 17 '26 23:01

Ben


1 Answers

First, you don't need to simulate anything to study the pmvnorm function:

pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(80,0,0), corr=diag(rep(1,3)))

The result is 0.5, as you expected.

Your means vector is approximately (79, 0, 0), so let's try it:

pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(79,0,0), corr=diag(rep(1,3)))

The result now is 0.8413447. There's nothing the matter. By specifying only the correlation matrix, you told the software to assume that all variances were unity. In your simulation, the variances were 400, 25, and 25: very different from what you specified in the arguments!

The correct calculation uses the covariance matrix of the data, not its correlation matrix:

pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=means, sigma=cov(df))

The result is 0.5178412, quite in keeping with the data.

like image 198
whuber Avatar answered Jan 19 '26 18:01

whuber



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!