Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I estimate R script running time?

Tags:

r

I have a data.frame as below:

> str(df)
'data.frame':    8219 obs. of  60 variables:
 $ q01: int  3 3 3 1 4 3 1 5 2 5 ...
 $ q02: int  3 3 3 2 4 5 4 4 3 5 ...
 $ q03: int  4 2 1 2 4 4 2 3 2 2 ...
 $ q04: int  3 4 2 3 2 2 2 4 4 5 ...
 .
 .
 .
 $q60: int   3 3 5 2 1 2 4 2 1 2 ...

Each item is int from 1-5.
When I run corr.test(df,method = "kendall"), I cannot get any output, even if two hours pass.

As time management, if I can estimate the process time, I can have a cup of coffee if that's 10 minutes. I can code another project first if that's two hours.

Is there a method to estimate the R script running time?

Plus, my laptop is a dual-core 2.4 GHz with 8 GB memory.

like image 909
kittygirl Avatar asked Oct 15 '25 15:10

kittygirl


1 Answers

Here is a way to estimate timings based on the number of rows of data. I simulate a 60 column data frame, then use lapply() and system.time() to calculate timings.

library(psych)
# create 9000 rows of data w/ 60 columns
system.time(data <- as.data.frame(matrix(round(runif(9000*60,min = 1, max = 5)),
                                         nrow = 9000)))
id <- 1:9000
data <- cbind(id,data)
observations <- c(100,200,500,1000,2000)
theTimings <- lapply(observations,function(x){
     system.time(r <- corr.test(data[id <= x,2:61],method = "kendall"))
})
theNames <- paste0("timings_",observations,"_obs")
names(theTimings) <- theNames
theTimings

...and the output:

> theTimings
$timings_100_obs
   user  system elapsed
  0.435   0.023   0.457

$timings_200_obs
   user  system elapsed
  1.154   0.019   1.174

$timings_500_obs
   user  system elapsed
  5.969   0.026   5.996

$timings_1000_obs
   user  system elapsed
 24.260   0.045  24.454

$timings_2000_obs
   user  system elapsed
106.465   0.109 106.603

Generating predictions

We can take the data from our analysis thus far, fit a model, and predict the timings for larger data sets. First, we create a data frame with the timing information, then fit a linear model. We'll print the model summary to check the R^2 for goodness of fit.

time <- c(0.457,1.174,5.996,24.454,106.603)
timeData <- data.frame(observations,time)
fit <- lm(time ~ observations, data = timeData)
summary(fit)

The summary indicates that a linear model appears to be a good fit with the data, recognizing that we used a small number of observations as input to the model.

> summary(fit)

Call:
lm(formula = time ~ observations, data = timeData)

Residuals:
      1       2       3       4       5
  9.808   4.906  -7.130 -16.769   9.186

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)  -14.970240   8.866838  -1.688  0.18993
observations   0.056193   0.008612   6.525  0.00731 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.38 on 3 degrees of freedom
Multiple R-squared:  0.9342,    Adjusted R-squared:  0.9122
F-statistic: 42.57 on 1 and 3 DF,  p-value: 0.007315

Next, we'll build another data frame with additional numbers of observations, and use this to generate predicted timings.

predictions <- data.frame(observations = c(3000,4000,5000,6000,7000,8000,9000))
data.frame(observations = predictions,predicted = predict(fit,predictions))

Given our model, the 9,000 observation data frame should take about 8.2 minutes on my laptop.

> data.frame(observations = predictions,predicted = predict(fit,predictions))
  observations predicted
1         3000  153.6102
2         4000  209.8037
3         5000  265.9971
4         6000  322.1906
5         7000  378.3841
6         8000  434.5776
7         9000  490.7710

> 490 / 60
[1] 8.166667

We'll need to run the model on larger numbers of observations to determine whether there is a number of observations between 2,000 and 9,000 that makes the algorithm degrade to less than linear scalability.

Also, note that the timings vary significantly based on the CPU speed, numbers of cores, and available RAM on a machine. These tests were conducted on a 2015-era MacBook Pro 15 with the following configuration.

Enter image description here

Improving the model

Given the back and forth on comments between the original post and my answer, we can hypothesize that there is a nonlinear effect that becomes prominent above 2,000 observations. We can test this by adding a quadratic term to the model and generating new predictions.

First, we'll collect the data for 3,000, 4,000, and 5,000 observations in order to increase the number of degrees of freedom in the model, as well as to provide more data from which we might detect a quadratic effect.

> theTimings
$timings_3000_obs
   user  system elapsed
259.444   0.329 260.149

$timings_4000_obs
   user  system elapsed
458.993   0.412 460.085

$timings_5000_obs
   user  system elapsed
730.178   0.839 731.915

Next, we'll run linear models with and without the quadratic effect, generate predictions, and compare the results. First, we'll run the models and print the summary for the quadratic model.

observations <- c(100,200,500,1000,2000,3000,4000,5000)
obs_squared <- observations^2
time <- c(0.457,1.174,5.996,24.454,106.603,260.149,460.085,731.951)
timeData <- data.frame(observations,obs_squared,time)
fitLinear <- lm(time ~ observations, data = timeData)
fitQuadratic <- lm(time ~ observations + obs_squared, data = timeData)
summary(fitQuadratic)
> summary(fitQuadratic)

Call:
lm(formula = time ~ observations + obs_squared, data = timeData)

Residuals:
      1       2       3       4       5       6       7       8
-0.2651  0.2384  0.7455 -0.2363 -2.8974  4.5976 -2.7581  0.5753

Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.121e+00  1.871e+00   0.599   0.5752
observations -7.051e-03  2.199e-03  -3.207   0.0238 *
obs_squared   3.062e-05  4.418e-07  69.307 1.18e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.764 on 5 degrees of freedom
Multiple R-squared:  0.9999,    Adjusted R-squared:  0.9999
F-statistic: 3.341e+04 on 2 and 5 DF,  p-value: 4.841e-11

Not only has the R^2 improved to .9999 with the quadratic model, both the linear and quadratic terms are significantly different from 0 at alpha = 0.05. Interestingly, with a quadratic term in the model, the linear effect is negative.

Finally, we'll generate predictions for both models, combine them into a data frame and print the results.

predLinear = predict(fitLinear,predictions)
predQuadratic <- predict(fitQuadratic,predictions)
data.frame(observations = predictions$observations,
           obs_squared = predictions$obs_squared,
           predLinear,
           predQuadratic)

...and the results:

  observations obs_squared predLinear predQuadratic
1         3000     9.0e+06   342.6230      255.5514
2         4000     1.6e+07   482.8809      462.8431
3         5000     2.5e+07   623.1388      731.3757
4         6000     3.6e+07   763.3967     1061.1490
5         7000     4.9e+07   903.6546     1452.1632
6         8000     6.4e+07  1043.9125     1904.4181
7         9000     8.1e+07  1184.1704     2417.9139

Conclusions

First, as we added data the linear prediction of processing time at 9,000 observations increased from 491 seconds to 1,184 seconds. As expected, adding data to the model helped improve its accuracy.

Second, the time prediction of the quadratic model was more than 2X that of the linear model, and the 2,417.9 second prediction was within 13.74 seconds of the actual runtime, less than an 0.6% error.

Appendix

Question: How long did it really take to process all observations?

When I ran the 9,000 observation data frame through the test, it took 40 minutes to complete. This was almost 5x longer than the initial linear prediction with runs up to 2,000 observations, and slightly more than 2X the linear prediction of up to 4,000 observations.

> # validate model
> system.time(r <- corr.test(data[,2:61],method = "kendall"))
    user   system  elapsed
2398.572    2.990 2404.175
> 2404.175 / 60
[1] 40.06958
>

Conclusion: while the linear version of the model is more accurate than the IHME COVID-19 predicted fatalities model that originally predicted 1,000,000 - 2,000,000 fatalities in the U.S, it's still too inaccurate to be useful as a predictor of how long one's machine will take to complete a 9,000 observation analysis across 60 variables with corr.test().

However, the quadratic model is very accurate, which illustrates the importance of developing multiple hypotheses before using any one model to make predictions.

Question: isn't number of cores irrelevant?

A couple of comments on my answer assert that since the R corr.test() function uses a single thread to process the data, the number of cores on a CPU is not relevant to the runtime performance.

My tests for this answer, as well as performance analyses I have done with R functions that support multithreading (e.g., Improving Performance of caret::train() with Random Forest), indicate that in practice, CPUs with similar speed, but fewer cores, are slower than those with more cores.

In this specific situation where we analyzed the performance of corr.test(), I ran a second series of tests on an HP Spectre x360 with an Intel Core i7-6500U CPU that also runs at 2.5 GHz, but it only has two cores. Its processing time degrades faster than that of the Intel Core i7-4870HQ CPU (also at 2.5 GHz), as illustrated by the following table.

Enter image description here

As we can see from the table, the i7-U6500 is 22.5% slower than the Core i7-4870HQ at 100 observations, and this deficit grows as the number of observations including in the timing simulations increases to 4,000.

like image 109
Len Greski Avatar answered Oct 17 '25 08:10

Len Greski



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!