Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove outliers (numbers 3 standard deviations away from the mean) in each column of a data frame

Tags:

r

I have a dataset with participant IDS and 17 different measures for each participant.

I need to remove outliers- numbers that are 3 standard deviations away from the mean both sides. This needs to happen for each column individually.

So far, by using the code below I have managed to add NA to an outlier column for each column, but it doesn't help me much, since I need to be able to either add NA to the column with the rest of the numbers or simply remove the outlier number

Ideally I want to get to a file that looks like this:

ID measure1 measure2 ....measure17
1  10897                  64436
2  184658    1739473
3            75758
4  746483    4327349      3612638
5  6444      36363        46447

Code I have used so far:

phenotypes <- colnames(imaging_data_kept[,2:ncol(imaging_data_kept)])

 for (i in phenotypes){
  Min <- mean(imaging_data_kept[[i]]) - (3*sd(imaging_data_kept[[i]]))
  Max <- mean(imaging_data_kept[[i]]) + (3*sd(imaging_data_kept[[i]]))  
  imaging_data_kept[[paste0(i,"_outliers")]] <- imaging_data_kept[[i]] < 
  Min | imaging_data_kept[[i]] > Max
 }

Sample data:

SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 4960.2 
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 3943.5 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1160.4 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1 
like image 998
Stelu Avatar asked Jan 01 '26 07:01

Stelu


2 Answers

This will replace values more than 3 SD from the mean with NA:

dd[,-1] <- lapply(dd[,-1],
      function(x) replace(x,abs(scale(x))>3,NA))

(The scale() function computes (x-mean(x))/sd(x); abs(scale(x))>3 should be reasonably self-explanatory; replace() replaces a specified set of indices with the indicated value.)

You can then use na.omit(dd) if you want to drop all rows that contain outliers in any column.

The sample data you gave us doesn't appear to have any outliers (according to your definition) -- I added some.


dd <- read.table(header=TRUE,
                 colClasses=c("character",rep("numeric",5)),
                 text="
SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 100000
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 100000 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1000000 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1
")
like image 109
Ben Bolker Avatar answered Jan 02 '26 21:01

Ben Bolker


I recommend using the boxplot()- function, which calculates outliers. You can acces them in your boxplot-object via boxplot$out or get the quantiles via boxplot$stats. Which is what I'm doing next.

But beware that boxplot does not calculate outliers in terms of 3 standard deviations but with Q1 - 1.5*IQR and Q3 + 1.5*IQR respectively.


library(dplyr) # for the pipe operators

#creating sample data 
df <- data.frame("var1" = c(-20.32, -15.29, rnorm(5,1,1), 11.23, 20.45),
                 "var2" = c(-12.43, -3.12, rnorm(5, 1,1), 10.75, 18.11))

#looks like that
> df
         var1        var2
1 -20.3200000 -12.4300000
2 -15.2900000  -3.1200000
3   0.9950276   1.2645415
4   1.7022687   0.8313770
5   1.8828154  -0.7459769
6   1.2299670   0.5053378
7   0.2749259   2.0239793
8  11.2300000  10.7500000
9  20.4500000  18.1100000

#remove outliers
nooutliers <- lapply(df, function(x) boxplot(df, plot = FALSE)) %>%
                lapply(`[`, "stats") %>% 
                  lapply(range) %>%
                    mapply(function (x,y) !between(x, y[1], y[2]), df, .) %>%
                      as.data.frame %>%
                        mapply(function(x,y) {y[x] <- NA; y},  
                               y = df, x = .)

#looks like this now
> nooutliers
           var1       var2
 [1,]        NA         NA
 [2,]        NA -3.1200000
 [3,] 0.9950276  1.2645415
 [4,] 1.7022687  0.8313770
 [5,] 1.8828154 -0.7459769
 [6,] 1.2299670  0.5053378
 [7,] 0.2749259  2.0239793
 [8,]        NA         NA
 [9,]        NA         NA

This code calculates the range within the whiskers for each column, assigns NA to all values outside of this range and returns a matrix.

I suppose this is what you're looking for.

UPDATE: With 3 standard deviations:

df <- data.frame("var1" = c(-210.32, rnorm(20,1,1), 234.45),
                 "var2" = c(-230.43, rnorm(20, 1,1), 213.11))


phenotypes <- colnames(df)

for (i in phenotypes){
  Min <- mean(df[[i]]) - (3*sd(df[[i]]))
  Max <- mean(df[[i]]) + (3*sd(df[[i]]))  
  df[[i]][df[[i]] < Min | df[[i]] > Max] <- NA}

This adopts your outlier definition.

like image 30
Humpelstielzchen Avatar answered Jan 02 '26 19:01

Humpelstielzchen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!