diff options
Diffstat (limited to 'docs/reference/aggregations/metrics/percentile-aggregation.asciidoc')
-rw-r--r-- | docs/reference/aggregations/metrics/percentile-aggregation.asciidoc | 192 |
1 files changed, 192 insertions, 0 deletions
diff --git a/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc b/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc new file mode 100644 index 0000000000..6bd1011007 --- /dev/null +++ b/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc @@ -0,0 +1,192 @@ +[[search-aggregations-metrics-percentile-aggregation]] +=== Percentiles Aggregation + +A `multi-value` metrics aggregation that calculates one or more percentiles +over numeric values extracted from the aggregated documents. These values +can be extracted either from specific numeric fields in the documents, or +be generated by a provided script. + +Percentiles show the point at which a certain percentage of observed values +occur. For example, the 95th percentile is the value which is greater than 95% +of the observed values. + +Percentiles are often used to find outliers. In normal distributions, the +0.13th and 99.87th percentiles represents three standard deviations from the +mean. Any data which falls outside three standard deviations is often considered +an anomaly. + +When a range of percentiles are retrieved, they can be used to estimate the +data distribution and determine if the data is skewed, bimodal, etc. + +Assume your data consists of website load times. The average and median +load times are not overly useful to an administrator. The max may be interesting, +but it can be easily skewed by a single slow response. + +Let's look at a range of percentiles representing load time: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "field" : "load_time" <1> + } + } + } +} +-------------------------------------------------- +<1> The field `load_time` must be a numeric field + +By default, the `percentile` metric will generate a range of +percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "load_time_outlier": { + "values" : { + "1.0": 15, + "5.0": 20, + "25.0": 23, + "50.0": 25, + "75.0": 29, + "95.0": 60, + "99.0": 150 + } + } + } +} +-------------------------------------------------- + +As you can see, the aggregation will return a calculated value for each percentile +in the default range. If we assume response times are in milliseconds, it is +immediately obvious that the webpage normally loads in 15-30ms, but occasionally +spikes to 60-150ms. + +Often, administrators are only interested in outliers -- the extreme percentiles. +We can specify just the percents we are interested in (requested percentiles +must be a value between 0-100 inclusive): + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "field" : "load_time", + "percents" : [95, 99, 99.9] <1> + } + } + } +} +-------------------------------------------------- +<1> Use the `percents` parameter to specify particular percentiles to calculate + + + +==== Script + +The percentile metric supports scripting. For example, if our load times +are in milliseconds but we want percentiles calculated in seconds, we could use +a script to convert them on-the-fly: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "script" : "doc['load_time'].value / timeUnit", <1> + "params" : { + "timeUnit" : 1000 <2> + } + } + } + } +} +-------------------------------------------------- +<1> The `field` parameter is replaced with a `script` parameter, which uses the +script to generate values which percentiles are calculated on +<2> Scripting supports parameterized input just like any other script + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +[[search-aggregations-metrics-percentile-aggregation-approximation]] +==== Percentiles are (usually) approximate + +There are many different algorithms to calculate percentiles. The naive +implementation simply stores all the values in a sorted array. To find the 50th +percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`. + +Clearly, the naive implementation does not scale -- the sorted array grows +linearly with the number of values in your dataset. To calculate percentiles +across potentially billions of values in an Elasticsearch cluster, _approximate_ +percentiles are calculated. + +The algorithm used by the `percentile` metric is called TDigest (introduced by +Ted Dunning in +https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]). + +When using this metric, there are a few guidelines to keep in mind: + +- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%) +are more accurate than less extreme percentiles, such as the median +- For small sets of values, percentiles are highly accurate (and potentially +100% accurate if the data is small enough). +- As the quantity of values in a bucket grows, the algorithm begins to approximate +the percentiles. It is effectively trading accuracy for memory savings. The +exact level of inaccuracy is difficult to generalize, since it depends on your +data distribution and volume of data being aggregated + +The following chart shows the relative error on a uniform distribution depending +on the number of collected values and the requested percentile: + +image:images/percentiles_error.png[] + +It shows how precision is better for extreme percentiles. The reason why error diminishes +for large number of values is that the law of large numbers makes the distribution of +values more and more uniform and the t-digest tree can do a better job at summarizing +it. It would not be the case on more skewed distributions. + +[[search-aggregations-metrics-percentile-aggregation-compression]] +==== Compression + +experimental[The `compression` parameter is specific to the current internal implementation of percentiles, and may change in the future] + +Approximate algorithms must balance memory utilization with estimation accuracy. +This balance can be controlled using a `compression` parameter: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "field" : "load_time", + "compression" : 200 <1> + } + } + } +} +-------------------------------------------------- +<1> Compression controls memory usage and approximation error + +The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the +more nodes available, the higher the accuracy (and large memory footprint) proportional +to the volume of data. The `compression` parameter limits the maximum number of +nodes to `20 * compression`. + +Therefore, by increasing the compression value, you can increase the accuracy of +your percentiles at the cost of more memory. Larger compression values also +make the algorithm slower since the underlying tree data structure grows in size, +resulting in more expensive operations. The default compression value is +`100`. + +A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount +of data which arrives sorted and in-order) the default settings will produce a +TDigest roughly 64KB in size. In practice data tends to be more random and +the TDigest will use less memory. |