Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter list of dataframe according to the mean of one column in R

I am trying to filter a list of dataframes depending on the mean value of one of their columns. If taking the following example:

# creating df1
df1 <- as_tibble(mtcars)

# creating df2
df2 <- as_tibble(iris)

# creating list of df (df_list)
df_list <- list(mtcars,iris)

# Checking the structure of the list
str(df_list)
List of 2
 $ : tibble [32 × 11] (S3: tbl_df/tbl/data.frame)
  ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
  ..$ disp: num [1:32] 160 160 108 258 360 ...
  ..$ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
  ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
  ..$ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
  ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
  ..$ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
  ..$ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
  ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
  ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
 $ : tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
  ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
  ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
  ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
  ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

I would like to obtain the means of the 3rd column for each df (disp and Petal.Lenght in this example), and then I would like to keep only the df for which the means of these columns are > 10.

I have tried the following approach:

  1. I created a function that returns a logical value depending on the calcualted mean:

    mean_logical <- function(column_mean) {
      column_mean_logical <- if_else(mean(column_mean) > 10, TRUE, FALSE)
      return(column_mean_logical)
    }
    
  2. Then, I wanted to use keep from {purrr} and apply my function (mean_logical) to filter the df with a mean in the third column < 10. However I am struggling on how to instruct to check the third column of each df in my list.

Of note, the only way I found to "access" the third column of each df in a list is by using the following:

lapply(df_list, "[", 3)

Any suggestion? Thanks in advance!

like image 949
Afeb Avatar asked Oct 15 '25 04:10

Afeb


2 Answers

You can use Filter from base

Filter(\(x) mean(x[[3]]) > 10, df_list)

or keep from purrr:

purrr::keep(df_list, \(x) mean(x[[3]]) > 10)

with an anonymous predicate function.

like image 196
Darren Tsai Avatar answered Oct 17 '25 17:10

Darren Tsai


An approach using subset or indexing with [

subset(df_list, sapply(df_list, function(x) mean(x[,3]) > 10))
df_list[sapply(df_list, function(x) mean(x[,3]) > 10)]

Since R 4.1.0 you can shorten function(x) with \(x)

like image 23
Andre Wildberg Avatar answered Oct 17 '25 19:10

Andre Wildberg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!