I implemented my_rgeom function using that if X ~ Geom(p) and Y ~ Exp(-log(1 − p)) then Z=ceiling(Y) has same distribution as X. Using inverse-transform method I can now define my function as:
my_rgeom <- function(N, p) {
U <- runif(N)
Y <- log(U) / log(1 - p)
return(ceiling(Y))
}
Why is R's implementation not using for example inverse-transform since its much faster?
Results of benchmarking this vs the standard rgeom function:
library(microbenchmark)
library(ggplot2)
library(gridExtra)
N <- 10^6
prob <- c(0.1, 0.2, 0.3, 0.4, 0.5)
results_list <- list()
for (p in prob) {
benchmark_results <- microbenchmark(
my_rgeom(N, p),
rgeom(N, p),
times = 100
)
# Store the results in the list
results_list[[as.character(p)]] <- benchmark_results
}
# Plot the results for each value of p
plots <- lapply(prob, function(p) {
autoplot(results_list[[as.character(p)]], title = paste("p =", p))
})
# Display the plots
gridExtra::grid.arrange(grobs = plots)
Results of code above
And output of results_list:
results_list
$`0.1`
Unit: milliseconds
expr min lq mean median uq max neval
my_rgeom(N, p) 63.2907 69.1584 76.95829 73.62385 78.8459 172.4273 100
rgeom(N, p) 115.0211 123.7101 131.01782 127.05655 134.2484 191.6843 100
$`0.2`
Unit: milliseconds
expr min lq mean median uq max neval
my_rgeom(N, p) 63.4618 76.8853 87.02115 85.91125 94.29075 195.5412 100
rgeom(N, p) 117.3709 130.1152 142.02507 135.24175 148.68975 274.5087 100
$`0.3`
Unit: milliseconds
expr min lq mean median uq max neval
my_rgeom(N, p) 66.2584 77.31255 85.50857 81.4676 88.8424 178.2553 100
rgeom(N, p) 115.3894 129.59360 139.56611 135.2741 146.1852 187.7755 100
$`0.4`
Unit: milliseconds
expr min lq mean median uq max neval
my_rgeom(N, p) 68.1903 75.1489 83.53151 80.1580 86.8057 180.0446 100
rgeom(N, p) 116.3780 121.7696 132.27215 126.9434 137.8039 213.7012 100
$`0.5`
Unit: milliseconds
expr min lq mean median uq max neval
my_rgeom(N, p) 70.0728 73.63675 82.20571 79.3234 84.6281 178.6583 100
rgeom(N, p) 116.5077 119.13275 126.49830 124.5081 130.1617 187.6423 100
The R implementation of rgeom
appears to be treating the geometric distribution as a special case of the negative binomial distribution, which relies on Poisson random variates. I can see no reason for not using inverse transform sampling, as it will be faster and can be made to be just as numerically stable as rgeom
. However, I would suggest a couple modifications to my_rgeom
.
First, -rexp(N)
will be faster than log(runif(N))
. Second, to better handle very small values of p
, log1p(-p)
is preferred over log(1 - p)
, as mentioned in the comments.
As a side note, rgeom
has support starting at 0
, so I will use floor
instead of ceiling
.
The following code demonstrates the above.
First define some candidate functions:
set.seed(1845088186)
rgeom1 <- function(N, p) {
U <- runif(N)
Y <- log(U) / log(1 - p)
return(floor(Y))
}
rgeom2 <- function(N, p) {
floor(-rexp(N)/log(1 - p))
}
rgeom3 <- function(N, p) {
floor(-rexp(N)/log1p(-p))
}
rgeom_all <- list(rgeom, rgeom1, rgeom2, rgeom3)
Test that they all return the same distribution.
(p <- runif(1))
#> [1] 0.08180412
x <- lapply(rgeom_all, \(f) f(1e6, p))
suppressWarnings(sapply(x[-1], \(y) ks.test(y, x[[1]])$p.value))
#> [1] 0.7088201 0.7734314 0.7856763
mapply(summary, x)
#> [,1] [,2] [,3] [,4]
#> Min. 0.00000 0.00000 0.00000 0.00000
#> 1st Qu. 3.00000 3.00000 3.00000 3.00000
#> Median 8.00000 8.00000 8.00000 8.00000
#> Mean 11.23687 11.23102 11.21926 11.21025
#> 3rd Qu. 16.00000 16.00000 16.00000 16.00000
#> Max. 203.00000 187.00000 146.00000 171.00000
Compare timings:
p <- runif(1e6)
microbenchmark::microbenchmark(
rgeom = rgeom(1e6, p),
rgeom1 = rgeom1(1e6, p),
rgeom2 = rgeom2(1e6, p),
rgeom3 = rgeom3(1e6, p)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rgeom 82.3725 91.00560 100.98802 100.48860 108.7335 146.5782 100
#> rgeom1 64.3914 74.54830 82.31284 80.25625 87.1611 123.3017 100
#> rgeom2 55.6838 60.14365 69.41881 66.87925 75.9065 143.6629 100
#> rgeom3 53.8316 58.36640 65.48733 64.77050 70.4816 102.4296 100
Note that log1p(-p)
is slightly faster than log(1 - p)
.
However, the behavior of rgeom
differs from that of the other three implementations when the variates are in the integer range:
sapply(rgeom_all, \(f) typeof(f(1, 0.5)))
#> [1] "integer" "double" "double" "double"
sapply(rgeom_all, \(f) typeof(f(1, .Machine$double.xmin)))
#> [1] "double" "double" "double" "double"
They also differ as to how they handle some invalid inputs:
rgeom(1, -0.5)
#> Warning in rgeom(1, -0.5): NAs produced
#> [1] NA
rgeom1(1, -0.5)
#> [1] -2
rgeom2(1, -0.5)
#> [1] -2
rgeom3(1, -0.5)
#> [1] -1
rgeom(1, 1.1)
#> Warning in rgeom(1, 1.1): NAs produced
#> [1] NA
rgeom1(1, 1.1)
#> Warning in log(1 - p): NaNs produced
#> [1] NaN
rgeom2(1, 1.1)
#> Warning in log(1 - p): NaNs produced
#> [1] NaN
rgeom3(1, 1.1)
#> Warning in log1p(-p): NaNs produced
#> [1] NaN
rgeom(1, NA)
#> Warning in rgeom(1, NA): NAs produced
#> [1] NA
rgeom1(1, NA)
#> [1] NA
rgeom2(1, NA)
#> [1] NA
rgeom3(1, NA)
#> [1] NA
The three candidate functions can be modified to mirror* the behavior of rgeom
with some impact on performance:
(*Closely enough--fully and faithfully mirroring the behavior of rgeom
in all cases will eliminate or even reverse the performance difference without resorting to C code.)
rgeom1 <- function(N, p) {
if (length(p) == 1L && p[1] < 0) return(rep(NA_integer_, N))
y <- log(runif(N))/log(1 - p)
y[p < 0] <- NA_real_
if (max(y, na.rm = 1) > .Machine$integer.max) floor(y) else as.integer(y)
}
rgeom2 <- function(N, p) {
if (length(p) == 1L && p[1] < 0) return(rep(NA_integer_, N))
y <- -rexp(N)/log(1 - p)
y[p < 0] <- NA_real_
if (max(y, na.rm = 1) > .Machine$integer.max) floor(y) else as.integer(y)
}
rgeom3 <- function(N, p) {
if (length(p) == 1L && p[1] < 0) return(rep(NA_integer_, N))
y <- -rexp(N)/log1p(-p)
y[p < 0] <- NA_real_
if (max(y, na.rm = 1) > .Machine$integer.max) floor(y) else as.integer(y)
}
Testing:
rgeom(1, -0.5)
#> Warning in rgeom(1, -0.5): NAs produced
#> [1] NA
rgeom1(1, -0.5)
#> Warning in max(y, na.rm = 1): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom2(1, -0.5)
#> Warning in max(y, na.rm = 1): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom3(1, -0.5)
#> Warning in max(y, na.rm = 1): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom(1, 1.1)
#> Warning in rgeom(1, 1.1): NAs produced
#> [1] NA
rgeom1(1, 1.1)
#> Warning in log(1 - p): NaNs produced
#> Warning in max(y, na.rm = 1): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom2(1, 1.1)
#> Warning in log(1 - p): NaNs produced
#> Warning in log(1 - p): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom3(1, 1.1)
#> Warning in log1p(-p): NaNs produced
#> Warning in log1p(-p): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom(1, NA)
#> Warning in rgeom(1, NA): NAs produced
#> [1] NA
rgeom1(1, NA)
#> Warning in max(y, na.rm = 1): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom2(1, NA)
#> Warning in max(y, na.rm = 1): no non-missing arguments to max; returning -Inf
#> [1] NA
rgeom3(1, NA)
#> Warning in max(y, na.rm = 1): no non-missing arguments to max; returning -Inf
#> [1] NA
Timing:
p <- runif(1e6, -0.1, 1.1)
p[sample(1e6, 1e3)] <- NA
microbenchmark::microbenchmark(
rgeom = rgeom(1e6, p),
rgeom1 = rgeom1(1e6, p),
rgeom2 = rgeom2(1e6, p),
rgeom3 = rgeom3(1e6, p)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rgeom 70.8633 72.26875 74.56052 73.36910 75.60830 88.3045 100
#> rgeom1 65.1287 69.38690 74.28986 71.11060 73.09685 114.2515 100
#> rgeom2 58.2231 62.40600 65.61609 64.12870 65.65905 99.9390 100
#> rgeom3 58.9580 63.21480 68.88956 64.88055 66.62870 116.3178 100
However, for very small values of p
, rgeom1
and rgeom2
are not well behaved:
sapply(rgeom_all, \(f) sum(is.finite(f(1e6, .Machine$double.xmin))))
#> Warning in f(1e+06, .Machine$double.xmin): NAs produced
#> [1] 981778 0 0 981709
rgeom3
is well behaved over all values of p
in addition to being the fastest of the four implementations.
@jblood94's answer is great, I'm going to try to address the "why?" part of this, from a historical perspective.
rgeom
code is 26 years old (see git blame); in 1997, people were much less likely to be concerned specifically about the time taken to generate 1 million geometric deviates (even today, you need to be picking a lot of deviates before the difference between 75 and 150 milliseconds to pick 1 million deviates becomes important ...)?rgeom
refers to Devroye (1986) Non-Uniform Random Variate Generation p. 480:‘rgeom’ uses the derivation as an exponential mixture of Poissons, see Devroye ...
p. 480 seems as though it might be a typo (it's the exercises on continuous univariate distributions): maybe this should have been p. 488 (discussing the negative binomial as a mixture of Poissons: the geometric is a special case)?
It's a little weird, and presumably an oversight, because pages 499-500 specifically discuss faster generation of geometric deviates.
If you can make a reasonable case for the value of this improvement (i.e. is it worth R-core's time to pull this method into base R, with the potential backward compatibility issues?), it would be worth discussing on [email protected]
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With