shift() in data.table v1.9.6 is slow for many groups

Tags:

r

data.table

Thanks for implementing shift in dt1.9.6 first. When I have many different groups, shift() is against expectations slower than my old code:

library(data.table)
library(microbenchmark)
set.seed(1)
mg <- data.table(expand.grid(year = 2012:2016, id = 1:1000),
                 value = rnorm(5000))
microbenchmark(dt194 = mg[, l1 := c(value[-1], NA), by = .(id)],
           dt196 = mg[, l2 := shift(value, n = 1,
                               type = "lead"), by = .(id)])
## Unit: milliseconds
##   expr      min        lq      mean    median       uq        max  eval
##  dt194  4.93735  5.236034  5.718654  5.623736  5.74395   9.555922   100
##  dt196 83.92612 87.530404 91.700317 90.953947 91.43783 257.473242   100

A detailed script is here: https://github.com/nachti/datatable_test/blob/master/leadtest.R

Did I misapply shift()?

Edit: Avoiding := doesn't help (@MichaelChirico):

microbenchmark(dt194 = mg[, c(value[-1], NA), by = id],
               dt196 = mg[, shift(value, n = 1,
                                   type = "lead"), by = id])

## Unit: milliseconds
##   expr       min        lq     mean    median        uq       max neval
##  dt194  5.161973  5.429927  5.78047  5.698263  5.798132  10.42217   100
##  dt196 79.526981 87.914502 92.18144 91.240949 91.896799 266.04031   100

Apart from this using := is part of the task ...

547

asked Feb 03 '16 14:02

nachti

1 Answers

In data.table version 1.14.3 this has been resolved and shift becomes faster than ever.

library(data.table)
library(microbenchmark)
set.seed(1)
mg = data.table(expand.grid(year=2012:2016, id=1:1000),
                value=rnorm(5000))
microbenchmark(v1.9.4  = mg[, c(value[-1], NA), by=id],
               v1.9.6  = mg[, shift_no_opt(value, n=1, type="lead"), by=id],
               v1.14.3 = mg[, shift(value, n=1, type="lead"), by=id],
               unit="ms")
# Unit: milliseconds
#     expr     min      lq    mean  median      uq    max neval
#   v1.9.4  3.6600  3.8250  4.4930  4.1720  4.9490 11.700   100
#   v1.9.6 18.5400 19.1800 21.5100 20.6900 23.4200 29.040   100
#  v1.14.3  0.4826  0.5586  0.6586  0.6329  0.7348  1.318   100

answered Oct 24 '22 18:10

Ben373

Related questions
                            
                                Remove duplicate and small vectors from list
                            
                                Effects from multinomial logistic model in mlogit
                            
                                Connecting to MS Access Database from R (x64)
                            
                                Why does dplyr::distinct behave like this for grouped data frames
                            
                                Plot timeline in R with only date variable
                            
                                Performing loops on list of lists of rasters
                            
                                Contour plots on a sphere surface
                            
                                rmarkdown html embed pdf/eps
                            
                                Plotting shp file in leaflet, works in ggplot
                            
                                How to get layer to top in Shiny Leaflet map
                            
                                R Lubridate Returns Unwanted Century When Given Two Digit Year
                            
                                Weird behaviour of lag function of dplyr inside mutate
                            
                                R ggplot geom_hex alpha transparency
                            
                                runapp shiny from other shiny application with actionbutton
                            
                                Manually creating a legend when you can't supply a color aesthetic
                            
                                convert factor to original numeric value
                            
                                R Shinyapps advanced settings
                            
                                R - Color or shade area between lines
                            
                                How to randomize a vector without repeating specific elements in predefined triples?
                            
                                Adding two rows of column names in Stargazer regression table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With