Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find columns that are constant by groups

Given a data frame (or a data.table object), say data, I want to find the columns which are constant in each of the groups defined by a factor given as a column of data. For numeric columns, a "nearly constant" one should be regarded as constant. A naive implementation is in the constant_by function below

is_constant <- function(x, eps = 1e-3) {
    if (any(is.na(x))) return(all(is.na(x)))
    else if (is.numeric(x)) return(diff(range(x)) < eps) 
    else return(length(unique(x)) == 1) 
}
constant_by <- function(data, by, eps = 1e-3) {
    is_constant_by <- function(x) {
        tapply(x, data[ , by], is_constant, eps = eps)
    }
    sapply(data, function(col) all(is_constant_by(col)))
}

The function is_constant could be made faster, but more importantly the package data.table could make things much faster.

Here is an example involving seismic data(caution: about 12MO). The data contains variables related to the event (earthquake) and variables related to the seismic station.

library(data.table)
temp <- tempfile()
download.file("https://shake.mi.ingv.it/ita18-flatfile/ITA18_SA_flatfile.zip", temp)
Ita18 <- fread(unzip(temp, "ITA18_SA_flatfile.csv"))
unlink(temp)
## 'station_code' refers to a specific network, so make a proper 'id'
Ita18 <- Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
## naive solution
st <- system.time({
    res <- constant_by(as.data.frame(Ita18), by = "station_id")
})
## attempt with data.table. Could be transformed into a function
st2 <- system.time({
    res2 <- list()
    for (col in names(Ita18)) {
        res2[[col]] <- all(Ita18[ , .(test = is_constant(get(col))),
                                 by = "station_id"]$test)
    }
})
all.equal(res, unlist(res2))
rbind(st, st2)

My data.table attempt is disappointingly slow. I believe that the loop should be "inside" the data table by giving a list to the j index, but I could not make it work.

like image 698
Yves Avatar asked Oct 24 '25 14:10

Yves


2 Answers

An option that runs in ~0.1 seconds:

library(data.table)
temp <- tempfile()
download.file("https://shake.mi.ingv.it/ita18-flatfile/ITA18_SA_flatfile.zip", temp)
Ita18 <- fread(unzip(temp, "ITA18_SA_flatfile.csv"))
unlink(temp)

system.time({
  eps <- 1e-3
  bnum <- vapply(Ita18[1], is.numeric, FALSE, USE.NAMES = FALSE)
  Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
  x <- Ita18[,lapply(.SD, max), station_id][,station_id := NULL]
  y <- Ita18[,lapply(.SD, min, na.rm = TRUE), station_id][,station_id := NULL]
  res <- logical(length(bnum)) # initialize the results vector
  res[bnum] <- colSums( # numerical comparisons
    x[,..bnum] - y[,..bnum] < eps | (is.na(x[,..bnum]) & is.na(y[,..bnum])),
    na.rm = TRUE
  ) == nrow(x)
  res[!bnum] <- colSums( # non-numerical comparisons
    x[,!..bnum] == y[,!..bnum] | (is.na(x[,!..bnum]) & is.na(y[,!..bnum])),
    na.rm = TRUE
  ) == nrow(x)
  res <- c(res, TRUE) # add `station_id` column
  names(res) <- names(Ita18)
})
#>    user  system elapsed 
#>    0.03    0.00    0.08

Compare to the original solution:

is_constant <- function(x, eps = 1e-3) {
  if (any(is.na(x))) return(all(is.na(x)))
  else if (is.numeric(x)) return(diff(range(x)) < eps) 
  else return(length(unique(x)) == 1) 
}

constant_by <- function(data, by, eps = 1e-3) {
  is_constant_by <- function(x) {
    tapply(x, data[ , by], is_constant, eps = eps)
  }
  sapply(data, function(col) all(is_constant_by(col)))
}

system.time({
  Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
  res0 <- constant_by(as.data.frame(Ita18), by = "station_id")
})
#>    user  system elapsed 
#>    4.29    0.03    4.41

# check that the answers are the same
identical(res, res0)
#> [1] TRUE
like image 144
jblood94 Avatar answered Oct 27 '25 02:10

jblood94


Minor modifications and your task runs in less than a second:

is_constant2 <- function(x, eps = 1e-3) {
  if (anyNA(x))           all(is.na(x))
  else if (is.numeric(x)) (max(x) - min(x)) < eps
  else                    length(unique(x)) == 1
}

res3 <- Ita18[, lapply(.SD, is_constant2), by = station_id] |>
  _[, lapply(.SD, all), .SDcols = !"station_id"]

identical(res2[names(res2) != "station_id"], as.list(res3))
# [1] TRUE
like image 22
sindri_baldur Avatar answered Oct 27 '25 02:10

sindri_baldur



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!