I have a simple dataset. When I generate boxplot for the data by base R and ggplot separately, they do not match. In fact the base R boxplot is consistent with the summary function.
library(tidyverse)
library(ggplotify)
library(patchwork)
df <- read.csv("test_boxplot_data.csv")
summary(df)
p1 <- as.ggplot(~boxplot(df$y, outline=FALSE))
p2 <- ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) + ylim(0,100)
p1 + p2 + plot_layout(ncol = 2)
Generated plot kept here.
Any clue what is happening? It is also surprising that ggplot throws warning that "Removed 845 rows containing non-finite values (stat_boxplot)" but there is no NA in the data.
From: "Removed 845 rows containing non-finite values (stat_boxplot)". It just so happens that the data contains 845 points > 100. These points are being deleted in the calculation of the box plot.
From the first line of help for ylim():
"This is a shortcut for supplying the limits argument to the individual scales. By default, any values outside the limits specified are replaced with NA. Be warned that this will remove data outside the limits and this can produce unintended results. For changing x or y axis limits without dropping data observations, see coord_cartesian()."
This should provide the desired graph:
ggplot(df, aes(y=y)) + geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim=c(0,100))

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With