Idiom for dropping a single column in a data.table

Tags:

r

data.table

I need to drop one column from a data.frame containing a few hundred columns.

With a data.frame, I'd use subset to do this conveniently:

> dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' )
> subset(dat,select=c(-z))
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

Obviously this still works, but it seems like not a very data.table-like idiom. I could manually construct a list of the column names I wanted to keep, which seems a little more data.table-like:

> dat[,list(x,y)]
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

But then I have to construct such a list, which is clunky.

Is subset the proper way to conveniently drop a column or two, or does it cause a performance hit? If not, what's the better way?

Edit

Benchmarks:

> dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' )
> microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] )
Unit: milliseconds
                         expr       min        lq    median        uq      max
1           dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2 subset(dat, select = c(-z))  33.26356  52.55311  53.53934  55.00347 180.8740

But really where it may matter more is for memory if subset copies the whole data.table.

858

asked May 10 '13 00:05

Ari B. Friedman

1 Answers

If you are wanting to remove the column permanently use := NULL

dat[, z := NULL]

If you have your columns to drop as a character string use () to force evaluation as a character string, not as the character name.

toDrop <- c('z')

dat[, (toDrop) := NULL]

If you want to limit the availability of the columns in .SD, you can pass the .SDcols argument

dat[,lapply(.SD, somefunction) , .SDcols = setdiff(names(dat),'z')]

However, data.table inspects the j arguments and only gets the columns you use any way. See FAQ 1.12

When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses.

and doesn't try and load all the data for .SD (unless you have .SD within your call to j)

subset.data.table is processing the call and eventually evaluating dat[, c('x','y'), with=FALSE]

using := NULL should be basically instantaneous, howveer t does permanently delete the column.

190

answered Sep 30 '22 17:09

mnel

Related questions
                            
                                Sequence of varying increments with R?
                            
                                correlation matrix to build networks
                            
                                How to specify header search path when using R CMD SHLIB to compile a C++ file?
                            
                                R eps export and import into Word 2010
                            
                                read.table and comments in R
                            
                                Don't drop zero in unquoted string as argument
                            
                                Echo state network?
                            
                                R ggplot2/ggmap concentric circles as points
                            
                                addSMA not drawn on graph when called from function
                            
                                package 'Rbbg' is not available (for R version 2.15.2)
                            
                                Page break (new page) in plots
                            
                                Partial Correlations in R [closed]
                            
                                Why won't RODBC upload a dataframe to SQL Server?
                            
                                Counting the number of rows of a series of csv files
                            
                                How can I control the x position of boxplots in ggplot2?
                            
                                How to draw parametric 3d curve with smoothing in R?
                            
                                How to build an archived package on R 3.0.0
                            
                                Replacing elements of a vector
                            
                                plyr package writing the same function over multiple columns
                            
                                Performing Operations on a Subset Using Data Table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With