I have what I think is a very simple question related to the use of data.table and the := function. I don't think I quite understand the behaviour of := and often I run into similar problems.
Here is some example data
mat <- structure(list( col1 = c(NA, 0, -0.015038, 0.003817, -0.011407), col2 = c(0.003745, 0.007463, -0.007407, -0.003731, -0.007491)), .Names = c("col1", "col2"), row.names = c(NA, 10L), class = c("data.table", "data.frame")) which gives
> mat col1 col2 1: NA 0.003745 2: 0.000000 0.007463 3: -0.015038 -0.007407 4: 0.003817 -0.003731 5: -0.011407 -0.007491 I want to create a column called col3 which gives the sum of col1 and col2. If I use
mat[,col3 := col1 + col2] # col1 col2 col3 #1: NA 0.003745 NA #2: 0.000000 0.007463 0.007463 #3: -0.015038 -0.007407 -0.022445 #4: 0.003817 -0.003731 0.000086 #5: -0.011407 -0.007491 -0.018898 then I get an NA for the first row, but I want NAs to be ignored. So I tried instead
mat[,col3 := sum(col1,col2,na.rm=TRUE)] # col1 col2 col3 #1: NA 0.003745 -0.030049 #2: 0.000000 0.007463 -0.030049 #3: -0.015038 -0.007407 -0.030049 #4: 0.003817 -0.003731 -0.030049 #5: -0.011407 -0.007491 -0.030049 which is not what I am after, since it is giving me the sum of all elements of col1 and col2. I think I don't quite get :=... How can I get the sum of the element of col1 and col2 ignoring NA values?
Not sure this is relevant, but here is my sessionInfo
> sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.3
To find the sum of non-missing values in an R data frame column, we can simply use sum function and set the na. rm to TRUE. For example, if we have a data frame called df that contains a column say x which has some missing values then the sum of the non-missing values can be found by using the command sum(df$x,na.
We can calculate the sum of multiple columns by using rowSums() and c() Function. we simply have to pass the name of the columns.
To find the row sums if NA exists in the R data frame, we can use rowSums function and set the na. rm argument to TRUE and this argument will remove NA values before calculating the row sums.
This is standard R behaviour, nothing really to do with data.table
Adding anything to NA will return NA
NA + 1 ## NA sum will return a single number
If you want 1 + NA to return 1
then you will have to run something like
mat[,col3 := col1 + col2] mat[is.na(col1), col3 := col2] mat[is.na(col2), col3 := col1] To deal with when col1 or col2 are NA
You could also use rowSums, which has a na.rm argument
mat[ , col3 :=rowSums(.SD, na.rm = TRUE), .SDcols = c("col1", "col2")] rowSums is what you want (by definition, the rowSums of a matrix containing col1 and col2, removing NA values
(@JoshuaUlrich suggested this as a comment )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With