I just noticed this issue with a column in a data.table that turned out to be of the integer64 class. I was reading the data using fread from a location on the internet and was not aware that the column in question was being interpreted as integer64, a class I am not familiar with. The issue is how this class behaves in a data.table when using sum() and by. It has been referenced similarly in two other questions on here, but that was in the context of using it as an ID value (Q1 and Q2)
When performing a sum() by group on this integer64 column, it does not behave as expected (as a numeric) when there are negative values in the column. Why is this? Is it a bug?
library(data.table); library(bit64)
z <- data.table(
group = c("A","A","A"),
int64 = as.integer64(c(10,20,-10)),
numeric = c(10,20,-10)
)
To start, it works fine without the by statement:
z[, sum(int64)] #20
z[, sum(int64, na.rm=T)] #20
And in non-data.table format
sum(z$int64)
sum(z$int64, na.rm = TRUE)
But when including the by statement, it gets fishy:
z[, sum(int64, na.rm=FALSE), by=group] #only the negative value
#group V1
#A -10
z[, sum(int64, na.rm=TRUE), by=group] #excluding the negative value
#group V1
#A 30
z[, sum(as.numeric(int64)), by=group] #expected answer
#group V1
#A 20
This is worrying to me as on the surface level there is no reason to believe anything is wrong with the numbers in z$int64 and I only noticed as there were very few rows.
This has now been corrected, see https://github.com/Rdatatable/data.table/issues/1647
z[, sum(int64, na.rm=FALSE), by=group]
# group V1
# <char> <i64>
#1: A 20
z[, sum(int64, na.rm=TRUE), by=group]
# group V1
# <char> <i64>
#1: A 20
z[, sum(as.numeric(int64)), by=group]
# group V1
# <char> <num>
#1: A 20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With