I have a data.frame as below:
> str(df)
'data.frame': 8219 obs. of 60 variables:
$ q01: int 3 3 3 1 4 3 1 5 2 5 ...
$ q02: int 3 3 3 2 4 5 4 4 3 5 ...
$ q03: int 4 2 1 2 4 4 2 3 2 2 ...
$ q04: int 3 4 2 3 2 2 2 4 4 5 ...
.
.
.
$q60: int 3 3 5 2 1 2 4 2 1 2 ...
Each item is int
from 1-5.
When I run corr.test(df,method = "kendall")
, I cannot get any output, even if two hours pass.
As time management, if I can estimate the process time, I can have a cup of coffee if that's 10 minutes. I can code another project first if that's two hours.
Is there a method to estimate the R script running time?
Plus, my laptop is a dual-core 2.4 GHz with 8 GB memory.
Here is a way to estimate timings based on the number of rows of data. I simulate a 60 column data frame, then use lapply()
and system.time()
to calculate timings.
library(psych)
# create 9000 rows of data w/ 60 columns
system.time(data <- as.data.frame(matrix(round(runif(9000*60,min = 1, max = 5)),
nrow = 9000)))
id <- 1:9000
data <- cbind(id,data)
observations <- c(100,200,500,1000,2000)
theTimings <- lapply(observations,function(x){
system.time(r <- corr.test(data[id <= x,2:61],method = "kendall"))
})
theNames <- paste0("timings_",observations,"_obs")
names(theTimings) <- theNames
theTimings
...and the output:
> theTimings
$timings_100_obs
user system elapsed
0.435 0.023 0.457
$timings_200_obs
user system elapsed
1.154 0.019 1.174
$timings_500_obs
user system elapsed
5.969 0.026 5.996
$timings_1000_obs
user system elapsed
24.260 0.045 24.454
$timings_2000_obs
user system elapsed
106.465 0.109 106.603
We can take the data from our analysis thus far, fit a model, and predict the timings for larger data sets. First, we create a data frame with the timing information, then fit a linear model. We'll print the model summary to check the R^2 for goodness of fit.
time <- c(0.457,1.174,5.996,24.454,106.603)
timeData <- data.frame(observations,time)
fit <- lm(time ~ observations, data = timeData)
summary(fit)
The summary indicates that a linear model appears to be a good fit with the data, recognizing that we used a small number of observations as input to the model.
> summary(fit)
Call:
lm(formula = time ~ observations, data = timeData)
Residuals:
1 2 3 4 5
9.808 4.906 -7.130 -16.769 9.186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.970240 8.866838 -1.688 0.18993
observations 0.056193 0.008612 6.525 0.00731 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.38 on 3 degrees of freedom
Multiple R-squared: 0.9342, Adjusted R-squared: 0.9122
F-statistic: 42.57 on 1 and 3 DF, p-value: 0.007315
Next, we'll build another data frame with additional numbers of observations, and use this to generate predicted timings.
predictions <- data.frame(observations = c(3000,4000,5000,6000,7000,8000,9000))
data.frame(observations = predictions,predicted = predict(fit,predictions))
Given our model, the 9,000 observation data frame should take about 8.2 minutes on my laptop.
> data.frame(observations = predictions,predicted = predict(fit,predictions))
observations predicted
1 3000 153.6102
2 4000 209.8037
3 5000 265.9971
4 6000 322.1906
5 7000 378.3841
6 8000 434.5776
7 9000 490.7710
> 490 / 60
[1] 8.166667
We'll need to run the model on larger numbers of observations to determine whether there is a number of observations between 2,000 and 9,000 that makes the algorithm degrade to less than linear scalability.
Also, note that the timings vary significantly based on the CPU speed, numbers of cores, and available RAM on a machine. These tests were conducted on a 2015-era MacBook Pro 15 with the following configuration.
Given the back and forth on comments between the original post and my answer, we can hypothesize that there is a nonlinear effect that becomes prominent above 2,000 observations. We can test this by adding a quadratic term to the model and generating new predictions.
First, we'll collect the data for 3,000, 4,000, and 5,000 observations in order to increase the number of degrees of freedom in the model, as well as to provide more data from which we might detect a quadratic effect.
> theTimings
$timings_3000_obs
user system elapsed
259.444 0.329 260.149
$timings_4000_obs
user system elapsed
458.993 0.412 460.085
$timings_5000_obs
user system elapsed
730.178 0.839 731.915
Next, we'll run linear models with and without the quadratic effect, generate predictions, and compare the results. First, we'll run the models and print the summary for the quadratic model.
observations <- c(100,200,500,1000,2000,3000,4000,5000)
obs_squared <- observations^2
time <- c(0.457,1.174,5.996,24.454,106.603,260.149,460.085,731.951)
timeData <- data.frame(observations,obs_squared,time)
fitLinear <- lm(time ~ observations, data = timeData)
fitQuadratic <- lm(time ~ observations + obs_squared, data = timeData)
summary(fitQuadratic)
> summary(fitQuadratic)
Call:
lm(formula = time ~ observations + obs_squared, data = timeData)
Residuals:
1 2 3 4 5 6 7 8
-0.2651 0.2384 0.7455 -0.2363 -2.8974 4.5976 -2.7581 0.5753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.121e+00 1.871e+00 0.599 0.5752
observations -7.051e-03 2.199e-03 -3.207 0.0238 *
obs_squared 3.062e-05 4.418e-07 69.307 1.18e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.764 on 5 degrees of freedom
Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
F-statistic: 3.341e+04 on 2 and 5 DF, p-value: 4.841e-11
Not only has the R^2 improved to .9999 with the quadratic model, both the linear and quadratic terms are significantly different from 0 at alpha = 0.05. Interestingly, with a quadratic term in the model, the linear effect is negative.
Finally, we'll generate predictions for both models, combine them into a data frame and print the results.
predLinear = predict(fitLinear,predictions)
predQuadratic <- predict(fitQuadratic,predictions)
data.frame(observations = predictions$observations,
obs_squared = predictions$obs_squared,
predLinear,
predQuadratic)
...and the results:
observations obs_squared predLinear predQuadratic
1 3000 9.0e+06 342.6230 255.5514
2 4000 1.6e+07 482.8809 462.8431
3 5000 2.5e+07 623.1388 731.3757
4 6000 3.6e+07 763.3967 1061.1490
5 7000 4.9e+07 903.6546 1452.1632
6 8000 6.4e+07 1043.9125 1904.4181
7 9000 8.1e+07 1184.1704 2417.9139
First, as we added data the linear prediction of processing time at 9,000 observations increased from 491 seconds to 1,184 seconds. As expected, adding data to the model helped improve its accuracy.
Second, the time prediction of the quadratic model was more than 2X that of the linear model, and the 2,417.9 second prediction was within 13.74 seconds of the actual runtime, less than an 0.6% error.
When I ran the 9,000 observation data frame through the test, it took 40 minutes to complete. This was almost 5x longer than the initial linear prediction with runs up to 2,000 observations, and slightly more than 2X the linear prediction of up to 4,000 observations.
> # validate model
> system.time(r <- corr.test(data[,2:61],method = "kendall"))
user system elapsed
2398.572 2.990 2404.175
> 2404.175 / 60
[1] 40.06958
>
Conclusion: while the linear version of the model is more accurate than the IHME COVID-19 predicted fatalities model that originally predicted 1,000,000 - 2,000,000 fatalities in the U.S, it's still too inaccurate to be useful as a predictor of how long one's machine will take to complete a 9,000 observation analysis across 60 variables with corr.test()
.
However, the quadratic model is very accurate, which illustrates the importance of developing multiple hypotheses before using any one model to make predictions.
A couple of comments on my answer assert that since the R corr.test()
function uses a single thread to process the data, the number of cores on a CPU is not relevant to the runtime performance.
My tests for this answer, as well as performance analyses I have done with R functions that support multithreading (e.g., Improving Performance of caret::train() with Random Forest), indicate that in practice, CPUs with similar speed, but fewer cores, are slower than those with more cores.
In this specific situation where we analyzed the performance of corr.test()
, I ran a second series of tests on an HP Spectre x360 with an Intel Core i7-6500U CPU that also runs at 2.5 GHz, but it only has two cores. Its processing time degrades faster than that of the Intel Core i7-4870HQ CPU (also at 2.5 GHz), as illustrated by the following table.
As we can see from the table, the i7-U6500 is 22.5% slower than the Core i7-4870HQ at 100 observations, and this deficit grows as the number of observations including in the timing simulations increases to 4,000.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With