Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ggplot2, histogram: why do y = ..density.. and stat = "density" differ?

Say I have this data frame df:

structure(list(max.diff = c(6.02, 7.56, 7.79, 7.43, 7.21, 7.65, 
8.1, 7.35, 7.57, 9.09, 6.21, 8.2, 6.82, 7.18, 7.78, 8.27, 6.85, 
6.72, 6.67, 6.99, 7.32, 6.59, 6.86, 6.02, 8.5, 7.25, 5.18, 8.85, 
5.44, 6.44, 7.85, 6.25, 9.06, 8.19, 5.08, 6.26, 8.92, 6.83, 6.5, 
7.55, 7.31, 5.83, 5.55, 4.29, 8.29, 8.72, 9.5)), class = "data.frame", row.names = c(NA, 
-47L), .Names = "max.diff")

I want to plot this as a density plot using ggplot2:

p <- ggplot(df, aes(x = max.diff)) 
p <- p + geom_histogram(stat = "density")
print(p)

which gives,

enter image description here

Now, a naive question: why doesn't this give the same result?

p <- ggplot(df, aes(x = max.diff)) 
p <- p + geom_histogram(aes(y = ..density..))
print(p)

enter image description here

Is this because of the chosen binwidth or number of bins or some other parameter? So far, I haven't been able to tweak those parameters to make them the same. Or am I plotting something quite different?

like image 868
Lyngbakr Avatar asked Oct 15 '25 03:10

Lyngbakr


1 Answers

The second example is rescaling the histogram counts so that bar areas integrate to 1, but is otherwise the same as the standard ggplot2 histogram. You can adjust the number of bars with the bins or the binwidth arguments.

The first example is calculating a kernel density estimate and plotting the output (the estimated density at each x-value) as a histogram. You can change the amount of smoothing of the density estimate with the adjust argument, and the number of points at which the density is calculated using the n argument.

The default for geom_histogram is bins=30. The default for stat="density" is adjust=1 and n=512 (stat="density" is using the density function to generate the values). The stat="density" output is much smoother than the histogram output due to the way density chooses the bandwidth for the density estimate. Reducing the adjust argument reduces the amount of smoothing.

The first two examples below are your plots. The second two use adjustments to the respective parameters to get two plots that are roughly similar, though not exactly the same because the kernel density estimate is still smoothing the output. This is just for illustration. The kernel density estimate and the histogram are two different, thought related, things.

ggplot(df, aes(x = max.diff)) +
  geom_histogram(stat = "density") +
  ggtitle("stat='density'; default paramters")

ggplot(df, aes(x = max.diff)) +
  geom_histogram(aes(y = ..density..), colour="white") +
  ggtitle("geom_histogram; default parameters")

ggplot(df, aes(x = max.diff)) +
  geom_histogram(stat = "density", n=2^5, adjust=0.1) +
  ggtitle("stat='density'; n=2^5; Adjust=0.1")

ggplot(df, aes(x = max.diff)) +
  geom_histogram(aes(y = ..density..), bins=2^5, colour="white") +
  ggtitle("geom_histogram; bins=2^5")

enter image description here

like image 72
eipi10 Avatar answered Oct 17 '25 19:10

eipi10