According to Prometheus documentation in order to have a 95th percentile using histogram metric I can use following query:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Source: https://prometheus.io/docs/practices/histograms/#quantiles
Since each bucket of histogram is a counter we can calculate rate each of the buckets as:
per-second average rate of increase of the time series in the range vector.
See: https://prometheus.io/docs/prometheus/latest/querying/functions/#rate
So, for instance, if bucket value[t-5m] = 100 and bucket value[t] = 200 then bucket rate[t] = (200-100)/(10*60) = 0.167
And finally, the most confusing part is how can histogram_quantile function find 95th percentile for given metric knowing all the bucket rates?
Is there any code or algorithm I can take a look to better understand it?
A solid example will explain histogram_quantile well.
Assumptions:
http_request_duration_seconds.10ms, 50ms, 100ms, 200ms, 300ms, 500ms, 1s, 2s, 3s, 5s
http_request_duration_seconds is a metric type of COUNTER
| time | value | delta | rate (quantity of items) |
|---|---|---|---|
| t-10m | 50 | N/A | N/A |
| t-5m | 100 | 50 | 50 / (5*60) |
| t | 200 | 100 | 100 / (5*60) |
| ... | ... | ... | ... |
rate() to calculate the quantity for each bucket
rate_xxx(t) = (value_xxx[t]-value_xxx[t-5m]) / (5m*60)is thequantity of itemsfor[t-5m, t]
value(t) and value(t-5m)) here.10000 http request durations (items) were recorded, that is,10000 = rate_10ms(t) + rate_50ms(t) + rate_100ms(t) + ... + rate_5s(t).| bucket(le) | 10ms | 50ms | 100ms | 200ms | 300ms | 500ms | 1s | 2s | 3s | 5s | +Inf |
|---|---|---|---|---|---|---|---|---|---|---|---|
| range | ~10ms | 10~50ms | 50~100ms | 100~200ms | 200~300ms | 300~500ms | 500ms~1s | 1~2s | 2s~3s | 3~5s | 5s~ |
| rate_xxx(t) | 3000 | 3000 | 1500 | 1000 | 800 | 400 | 200 | 40 | 30 | 5 | 5 |
Bucket is the essence of histogram. We just need 10 numbers in rate_xxx(t) to do the quantile calculation
Let's take a close look at this expression (aggregation like sum() is omitted for simplicity)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
We are actually looking for the 95%th item in rate_xxx(t) from bucket=10ms to bucket=+Inf. And 95%th means 9500th here since we got 10000 items in total (10000 * 0.95).
From the table above, there are 9300 = 3000+3000+1500+1000+800 items together before bucket=500ms.
So the 9500th item is the 200th item (9500-9300) in bucket=500ms(range=300~500ms) which got 400 items within
And Prometheus assumes that items in a bucket spread evenly in a linear pattern.
The metric value for the 200th item in bucket=500ms is 400ms = 300+(500-300)*(200/400)
That is, 95% is 400ms.
There are a few to bear in mind
COUNTER in nature for histogram metric typele definedPrometheus makes this assumption at least
histogram_quantile is an approximationP.S.:
The metric value is not always accurate due to the assumption of Items (Data) in a specific bucket spread evenly a linear pattern
Say, the max duration in reality (e.g.: from nginx access log) in bucket=500ms(range=300~500ms) is 310ms, however, we will get 400ms from histogram_quantile via above setup which is quite confusing sometimes.
The smaller bucket distance is, the more accurate approximation is.
So setup the bucket distances that fit your needs.
I believe this is the code for it in prometheus
The general idea is that you use the data in the buckets to extrapolate / approximate the quantiles
Elasticsearch also does something similar (yet different/much simpler) in their rollup capabilities
You can refer to my reply here
Actually the rate() function is just used to specify the time window, the denominator has no effect in the computation of the pecentile value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With