Given a data frame (or a data.table object), say data, I want to
find the columns which are constant in each of the groups defined by a
factor given as a column of data. For numeric columns, a "nearly
constant" one should be regarded as constant. A naive implementation
is in the constant_by function below
is_constant <- function(x, eps = 1e-3) {
if (any(is.na(x))) return(all(is.na(x)))
else if (is.numeric(x)) return(diff(range(x)) < eps)
else return(length(unique(x)) == 1)
}
constant_by <- function(data, by, eps = 1e-3) {
is_constant_by <- function(x) {
tapply(x, data[ , by], is_constant, eps = eps)
}
sapply(data, function(col) all(is_constant_by(col)))
}
The function is_constant could be made faster, but more importantly
the package data.table could make things much faster.
Here is an example involving seismic data(caution: about 12MO). The data contains variables related to the event (earthquake) and variables related to the seismic station.
library(data.table)
temp <- tempfile()
download.file("https://shake.mi.ingv.it/ita18-flatfile/ITA18_SA_flatfile.zip", temp)
Ita18 <- fread(unzip(temp, "ITA18_SA_flatfile.csv"))
unlink(temp)
## 'station_code' refers to a specific network, so make a proper 'id'
Ita18 <- Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
## naive solution
st <- system.time({
res <- constant_by(as.data.frame(Ita18), by = "station_id")
})
## attempt with data.table. Could be transformed into a function
st2 <- system.time({
res2 <- list()
for (col in names(Ita18)) {
res2[[col]] <- all(Ita18[ , .(test = is_constant(get(col))),
by = "station_id"]$test)
}
})
all.equal(res, unlist(res2))
rbind(st, st2)
My data.table attempt is disappointingly slow. I believe that the
loop should be "inside" the data table by giving a list to the j index,
but I could not make it work.
An option that runs in ~0.1 seconds:
library(data.table)
temp <- tempfile()
download.file("https://shake.mi.ingv.it/ita18-flatfile/ITA18_SA_flatfile.zip", temp)
Ita18 <- fread(unzip(temp, "ITA18_SA_flatfile.csv"))
unlink(temp)
system.time({
eps <- 1e-3
bnum <- vapply(Ita18[1], is.numeric, FALSE, USE.NAMES = FALSE)
Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
x <- Ita18[,lapply(.SD, max), station_id][,station_id := NULL]
y <- Ita18[,lapply(.SD, min, na.rm = TRUE), station_id][,station_id := NULL]
res <- logical(length(bnum)) # initialize the results vector
res[bnum] <- colSums( # numerical comparisons
x[,..bnum] - y[,..bnum] < eps | (is.na(x[,..bnum]) & is.na(y[,..bnum])),
na.rm = TRUE
) == nrow(x)
res[!bnum] <- colSums( # non-numerical comparisons
x[,!..bnum] == y[,!..bnum] | (is.na(x[,!..bnum]) & is.na(y[,!..bnum])),
na.rm = TRUE
) == nrow(x)
res <- c(res, TRUE) # add `station_id` column
names(res) <- names(Ita18)
})
#> user system elapsed
#> 0.03 0.00 0.08
Compare to the original solution:
is_constant <- function(x, eps = 1e-3) {
if (any(is.na(x))) return(all(is.na(x)))
else if (is.numeric(x)) return(diff(range(x)) < eps)
else return(length(unique(x)) == 1)
}
constant_by <- function(data, by, eps = 1e-3) {
is_constant_by <- function(x) {
tapply(x, data[ , by], is_constant, eps = eps)
}
sapply(data, function(col) all(is_constant_by(col)))
}
system.time({
Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
res0 <- constant_by(as.data.frame(Ita18), by = "station_id")
})
#> user system elapsed
#> 4.29 0.03 4.41
# check that the answers are the same
identical(res, res0)
#> [1] TRUE
Minor modifications and your task runs in less than a second:
is_constant2 <- function(x, eps = 1e-3) {
if (anyNA(x)) all(is.na(x))
else if (is.numeric(x)) (max(x) - min(x)) < eps
else length(unique(x)) == 1
}
res3 <- Ita18[, lapply(.SD, is_constant2), by = station_id] |>
_[, lapply(.SD, all), .SDcols = !"station_id"]
identical(res2[names(res2) != "station_id"], as.list(res3))
# [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With