Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is ggplot ignoring my factor levels when I subset my data?

I am using some code I got from an answer to a previous question, but I ran into a funny problem and Id like some expert insight into what is going on. I am trying to plot monthly deviations from an annual mean using bar charts. Specifically I am coloring the different bars different colors depending on whether the monthly mean is above or below the annual mean. I am using the txhousing dataset, which is included with the ggplot2 package.

I thought I could use a factor to denote whether or not this is the case. The months are correctly ordered when I only plot a subset of the data (the "lower" values, but when I add another plot, ggplot rearranges all of the months to be alphabetical. Does anyone know why this happens, and what a workaround would be?

Thank you so much for any input! Criticism of my code is welcome :)

Reproducible Examples

1. Using just one plot

library(tidyverse)

# subset txhousing to just years >= 2011, and calculate nested means and dates
housing_df <- filter(txhousing, year == 2014) %>%
  group_by(year, month) %>%
  summarise(monthly_mean = mean(sales, na.rm = TRUE),
            date = first(date)) %>%
  mutate(month = factor(month.abb[month], levels = month.abb, ordered = TRUE),
         salesdiff = monthly_mean - mean(monthly_mean), # monthly deviation
         higherlower = case_when(salesdiff >= 0 ~ "higher",                                   
                                 salesdiff < 0 ~ "lower"))

ggplot(data = housing_df, aes(x = month, y = salesdiff, higherlower)) +
  geom_col(data = filter(housing_df, higherlower == "higher"), aes(y = salesdiff, fill = higherlower)) +
  scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
  theme_bw() +
  theme(legend.position = "none") # remove legend

enter image description here

2. Using two plots with all of the data:

ggplot(data = housing_df, aes(x = month, y = salesdiff, higherlower)) +
  geom_col(data = filter(housing_df, higherlower == "higher"), aes(y = salesdiff, fill = higherlower)) +
  geom_col(data = filter(housing_df, higherlower == "lower"), aes(y = salesdiff, fill = higherlower)) +
  scale_fill_manual(values = c("higher" = "blue", "lower" = "red")) +
  theme_bw() +
  theme(legend.position = "none") # remove legend

enter image description here

like image 639
caker1012 Avatar asked Sep 19 '25 15:09

caker1012


1 Answers

There are multiple ways to do this but I find it a bit of a hit and trial. You are already doing the most common fix which is t convert month into a factor and that's why the first plot works. Why does it not work in the 2nd case is a bit of a mystery but try adding + scale_x_discrete(limits= housing_df$month) to override the x axis order and see if that works.

I agree to the other comments that the best way would be not even use the extra layer as its not needed in this specific case but the above solution works even when there are multiple layers.

like image 121
Rohit Das Avatar answered Sep 22 '25 07:09

Rohit Das