Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Box-plot R calculating outliers

> summary(mydata)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0      93     107     110     125     197 
> range=1.5*(125-93)
> upper_whisker=125+range
> lower_whisker=93-range
> upper_whisker
[1] 173
> lower_whisker
[1] 45
> boxplot(mydata)$stats
  [,1]
[1,]   56  #Lower whisker by boxplot
[2,]   93   
[3,]  107
[4,]  125
[5,]  173

I tried looking up the formula for calculating after and before what values are the points to be considered outliers It was

 Above =>3rd Qu +(3rd Qu - 1st Qu)*1.5
 Below =>1st Qu -(3rd Qu - 1st Qu)*1.5

For some reason they don't seem to match with the stats returned by boxplot function in R I have a feeling it's something silly here

Are they calculated differently? Or am I reading the wrong answer from boxplot?

Edit:

I've used https://www.kaggle.com/uciml/pima-indians-diabetes-database and ran

mydata=raw$Glucose[raw$Outcome==0]

EDIT2:The Box-plot

I suppose if

#max(min(x), Q1 - (IQR(x)*1.5)) #lower whisker

is returning min(x), there shouldn't be any outliers and min(mydata) is 0

Edit 3: Clearer view of Quantile

quantile(mydata)
0%  25%  50%  75% 100% 
0   93  107  125  197 

Edit 4: Added vector as requested

c(85L, 89L, 116L, 115L, 110L, 139L, 103L, 126L, 99L, 97L, 145L, 
117L, 109L, 88L, 92L, 122L, 103L, 138L, 180L, 133L, 106L, 159L, 
146L, 71L, 105L, 103L, 101L, 88L, 150L, 73L, 100L, 146L, 105L, 
84L, 44L, 141L, 99L, 109L, 95L, 146L, 139L, 129L, 79L, 0L, 62L, 
95L, 112L, 113L, 74L, 83L, 101L, 110L, 106L, 100L, 107L, 80L, 
123L, 81L, 142L, 144L, 92L, 71L, 93L, 151L, 125L, 81L, 85L, 126L, 
96L, 144L, 83L, 89L, 76L, 78L, 97L, 99L, 111L, 107L, 132L, 120L, 
118L, 84L, 96L, 125L, 100L, 93L, 129L, 105L, 128L, 106L, 108L, 
154L, 102L, 57L, 106L, 147L, 90L, 136L, 114L, 153L, 99L, 109L, 
88L, 151L, 102L, 114L, 100L, 148L, 120L, 110L, 111L, 87L, 79L, 
75L, 85L, 143L, 87L, 119L, 0L, 73L, 141L, 111L, 123L, 85L, 105L, 
113L, 138L, 108L, 99L, 103L, 111L, 96L, 81L, 147L, 179L, 125L, 
119L, 142L, 100L, 87L, 101L, 197L, 117L, 79L, 122L, 74L, 104L, 
91L, 91L, 146L, 122L, 165L, 124L, 111L, 106L, 129L, 90L, 86L, 
111L, 114L, 193L, 191L, 95L, 142L, 96L, 128L, 102L, 108L, 122L, 
71L, 106L, 100L, 104L, 114L, 108L, 129L, 133L, 136L, 155L, 96L, 
108L, 78L, 161L, 151L, 126L, 112L, 77L, 150L, 120L, 137L, 80L, 
106L, 113L, 112L, 99L, 115L, 129L, 112L, 157L, 179L, 105L, 118L, 
87L, 106L, 95L, 165L, 117L, 130L, 95L, 0L, 122L, 95L, 126L, 139L, 
116L, 99L, 92L, 137L, 61L, 90L, 90L, 88L, 158L, 103L, 147L, 99L, 
101L, 81L, 118L, 84L, 105L, 122L, 98L, 87L, 93L, 107L, 105L, 
109L, 90L, 125L, 119L, 100L, 100L, 131L, 116L, 127L, 96L, 82L, 
137L, 72L, 123L, 101L, 102L, 112L, 143L, 143L, 97L, 83L, 119L, 
94L, 102L, 115L, 94L, 135L, 99L, 89L, 80L, 139L, 90L, 140L, 147L, 
97L, 107L, 83L, 117L, 100L, 95L, 120L, 82L, 91L, 119L, 100L, 
135L, 86L, 134L, 120L, 71L, 74L, 88L, 115L, 124L, 74L, 97L, 154L, 
144L, 137L, 119L, 136L, 114L, 137L, 114L, 126L, 132L, 123L, 85L, 
84L, 139L, 173L, 99L, 194L, 83L, 89L, 99L, 80L, 166L, 110L, 81L, 
154L, 117L, 84L, 94L, 96L, 75L, 130L, 84L, 120L, 139L, 91L, 91L, 
99L, 125L, 76L, 129L, 68L, 124L, 114L, 125L, 87L, 97L, 116L, 
117L, 111L, 122L, 107L, 86L, 91L, 77L, 105L, 57L, 127L, 84L, 
88L, 131L, 164L, 189L, 116L, 84L, 114L, 88L, 84L, 124L, 97L, 
110L, 103L, 85L, 87L, 99L, 91L, 95L, 99L, 92L, 154L, 78L, 130L, 
111L, 98L, 143L, 119L, 108L, 133L, 109L, 121L, 100L, 93L, 103L, 
73L, 112L, 82L, 123L, 67L, 89L, 109L, 108L, 96L, 124L, 124L, 
92L, 152L, 111L, 106L, 105L, 106L, 117L, 68L, 112L, 92L, 183L, 
94L, 108L, 90L, 125L, 132L, 128L, 94L, 102L, 111L, 128L, 92L, 
104L, 94L, 100L, 102L, 128L, 90L, 103L, 157L, 107L, 91L, 117L, 
123L, 120L, 106L, 101L, 120L, 127L, 162L, 112L, 98L, 154L, 165L, 
99L, 68L, 123L, 91L, 93L, 101L, 56L, 95L, 136L, 129L, 130L, 107L, 
140L, 107L, 121L, 90L, 99L, 127L, 118L, 122L, 129L, 110L, 80L, 
127L, 158L, 126L, 134L, 102L, 94L, 108L, 83L, 114L, 117L, 111L, 
112L, 116L, 141L, 175L, 92L, 106L, 105L, 95L, 126L, 65L, 99L, 
102L, 109L, 153L, 100L, 81L, 121L, 108L, 137L, 106L, 88L, 89L, 
101L, 122L, 121L, 93L)
like image 826
Jai Avatar asked Oct 29 '25 06:10

Jai


2 Answers

Your calculation was almost right, R uses this:

#max(min(x), Q1 - (IQR(x)*1.5)) #lower whisker
#min(max(x), Q3 + (IQR(x)*1.5)) #upper whisker

That's why, it picks the max/min between the min(x)/max(x), and the standard formula.

Here an example:

my_data <- mtcars$mpg

bp <- boxplot(my_data)
bp$stats
# [1,] 10.40 # lower whisker
# [2,] 15.35
# [3,] 19.20 # == median(my_data)
# [4,] 22.80
# [5,] 33.90 # upper whisker


max(min(my_data,na.rm=T), as.numeric(quantile(my_data, 0.25)) - (IQR(my_data)*1.5))
#[1] 10.4 #lower whisker
min(max(my_data,na.rm=T), as.numeric(quantile(my_data, 0.75)) + (IQR(my_data)*1.5))
#[1] 33.9 # upper whisker
like image 130
RLave Avatar answered Oct 30 '25 20:10

RLave


I think there are few things to clarified. The first thing is that you should always provide a reproducible example for helping people to help you. An outlier is defined as a data point that is located outside the whiskers of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile). The correct way to figure out how this work is simulating some Student's T data under a pre-specified a random number generator state.

set.seed(1)
mydata <- rt(100, df = 3)
boxplot(mydata)
summary(mydata)

Boxplot

Then we can calculate the interquartile range and the lower and upper bounds for outliers according to the rule in the text above

t <- as.vector(summary(mydata))
iqr.range <- t[5]-t[2]
upper_outliers <- t[5]+iqr.range*1.5
lower_outliers <- t[2]-iqr.range*1.5

Let's check the data which are defined as outliers, while the boxplot whiskers are the data points immediately before/after the lower/upper boundaries.

 mydata[mydata<lower_outliers]
 [1] -3.527006 -2.959327 -2.754192
 mydata[mydata>upper_outliers]
 [1] 3.080302 3.527205
like image 39
paoloeusebi Avatar answered Oct 30 '25 19:10

paoloeusebi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!