Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does stat_summary plot multiple/single lines depending on the variable?

I asked this question a little bit ago. In that, the solution seems to work sometimes. Here is an example using the mpg data set.

My goal is to place a vertical line where the median of my data occur for each facet using stat_summary. Note that when I use the solution in the linked question on the displ column, the solution works as desired. But when I use it on the cty column, multiple lines are drawn. Why is this?

Shown below is a reprex of my problem.

library(tidyverse)

mpg %>% 
  ggplot(aes(x=displ, group=cyl))+
  geom_histogram()+
  facet_grid(~cyl)+
  stat_summary(aes(xintercept=stat(x), y=0), fun = median, geom = 'vline')
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mpg %>% 
  ggplot(aes(x=cty, group=cyl))+
  geom_histogram()+
  facet_grid(~cyl)+
  stat_summary(aes(xintercept=stat(x), y=0), fun = median, geom = 'vline')
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2020-04-01 by the reprex package (v0.3.0)

like image 574
Demetri Pananos Avatar asked Oct 22 '25 06:10

Demetri Pananos


2 Answers

Demetri, here is the R code that will give you what you need:

library(tidyverse) 

g <- mpg %>% 
     ggplot(aes(x=cty)) + 
     geom_histogram() + 
     stat_summary(aes(x = 0, xintercept = stat(y), y = cty), 
                     fun.y = median, geom = "vline", colour = "red") + 
     facet_grid(~ cyl)


g 

The stat_summary() function is set up to compute a summary (in this case, the median) for the variable specified in its y argument. In contrast, the geom_histogram() function creates a histogram for the variable specified in its x argument. So you have to be careful with how you specify the y argument for the stat_summary() function, as seen in the code above.

Note that you dont't need to use group = cyl in your ggplot() call if you are using facet_grid() or facet_wrap() to produce multiple graphical panels. Grouping and facetting are totally different plotting operations: grouping will show different data groups in the same panel; facetting will show different data groups in different panels.

Addendum 1

To check that the summary statistics were computed correctly for each panel, the command below will come in handy:

ggplot_build(g)$data

Scroll to the bottom of the output produced by this command to find the xintercept values used by R - these should be the medians plotted in the various panels. Alternatively, extract these values directly with:

ggplot_build(g)$data[[2]]

The xintercept values can be compared with independently computed median values of cty for each cyl level to ensure agreement.

Addendum 2

The default choice of binwidth for geom_histogram() needs some attention. You can try something like this to allow variable binwidth choice across your different panels:

theme_set(theme_bw())

g <- mpg %>% 
  ggplot(aes(x=cty)) + 
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), 
             fill = "lightblue3", colour = "white") + 
  stat_summary(aes(x = 0, xintercept = stat(y), y = cty), 
        fun.y = median, geom = "vline", colour = "red2") + 
  facet_wrap(~ cyl, scales = "free_x")


g 

See this link for other possibilities of binwidth choice: https://github.com/tidyverse/ggplot2/issues/2312.

like image 173
Isabella Ghement Avatar answered Oct 23 '25 21:10

Isabella Ghement


We can pre-compute the median using group_by and mutate, which I often find more reliable and easy to understand in its behavior, and then just use geom_vline. Can't answer on the stat_summary side, but interested to know the answer.

mpg %>%
  group_by(cyl) %>%
  mutate(cty_med = median(cty)) %>%
  ggplot(aes(x=cty))+
  geom_histogram()+
  facet_grid(~cyl)+
  geom_vline(aes(xintercept=cty_med))

enter image description here

If you want to generalize this, you can just create a wrapper function that does your calculation and faceting together.

f <- function(df, fct, var) {
  df %>%
    group_by({{fct}}) %>%
    mutate(med = median({{var}})) %>%
    ggplot(aes(x={{var}}))+
    geom_histogram() +
    facet_grid(cols = vars({{fct}})) +
    geom_vline(aes(xintercept=med))
}

f(mpg, cyl, cty)
f(mpg, cyl, displ)
like image 20
caldwellst Avatar answered Oct 23 '25 19:10

caldwellst