diff options
Diffstat (limited to 'docs/reference/aggregations/bucket/histogram-aggregation.asciidoc')
-rw-r--r-- | docs/reference/aggregations/bucket/histogram-aggregation.asciidoc | 319 |
1 files changed, 319 insertions, 0 deletions
diff --git a/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc new file mode 100644 index 0000000000..cd1fd06dda --- /dev/null +++ b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc @@ -0,0 +1,319 @@ +[[search-aggregations-bucket-histogram-aggregation]] +=== Histogram Aggregation + +A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. +It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field +that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5` +(in case of price it may represent $5). When the aggregation executes, the price field of every document will be +evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5` +then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`. +To make this more formal, here is the rounding function that is used: + +[source,java] +-------------------------------------------------- +rem = value % interval +if (rem < 0) { + rem += interval +} +bucket_key = value - rem +-------------------------------------------------- + +The following snippet "buckets" the products based on their `price` by interval of `50`: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50 + } + } + } +} +-------------------------------------------------- + +And the following may be the response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "prices" : { + "buckets": [ + { + "key": 0, + "doc_count": 2 + }, + { + "key": 50, + "doc_count": 4 + }, + { + "key": 100, + "doc_count": 0 + }, + { + "key": 150, + "doc_count": 3 + } + ] + } + } +} +-------------------------------------------------- + +==== Minimum document count + +The response above show that no documents has a price that falls within the range of `[100 - 150)`. By default the +response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with +a higher minimum count thanks to the `min_doc_count` setting: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "min_doc_count" : 1 + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "prices" : { + "buckets": [ + { + "key": 0, + "doc_count": 2 + }, + { + "key": 50, + "doc_count": 4 + }, + { + "key": 150, + "doc_count": 3 + } + ] + } + } +} +-------------------------------------------------- + +[[search-aggregations-bucket-histogram-aggregation-extended-bounds]] +By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with +the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the +documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when +requesting empty buckets, this causes a confusion, specifically, when the data is also filtered. + +To understand why, let's look at an example: + +Lets say the you're filtering your request to get all docs with values between `0` and `500`, in addition you'd like +to slice the data per price using a histogram with an interval of `50`. You also specify `"min_doc_count" : 0` as you'd +like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than `100`, +the first bucket you'll get will be the one with `100` as its key. This is confusing, as many times, you'd also like +to get those buckets between `0 - 100`. + +With `extended_bounds` setting, you now can "force" the histogram aggregation to start building buckets on a specific +`min` values and also keep on building buckets up to a `max` value (even if there are no documents anymore). Using +`extended_bounds` only makes sense when `min_doc_count` is 0 (the empty buckets will never be returned if `min_doc_count` +is greater than 0). + +Note that (as the name suggest) `extended_bounds` is **not** filtering buckets. Meaning, if the `extended_bounds.min` is higher +than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the +same goes for the `extended_bounds.max` and the last bucket). For filtering buckets, one should nest the histogram aggregation +under a range `filter` aggregation with the appropriate `from`/`to` settings. + +Example: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "filtered" : { "filter": { "range" : { "price" : { "to" : "500" } } } } + }, + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "extended_bounds" : { + "min" : 0, + "max" : 500 + } + } + } + } +} +-------------------------------------------------- + +==== Order + +By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controled +using the `order` setting. + +Ordering the buckets by their key - descending: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "_key" : "desc" } + } + } + } +} +-------------------------------------------------- + +Ordering the buckets by their `doc_count` - ascending: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "_count" : "asc" } + } + } + } +} +-------------------------------------------------- + +If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine the order of the buckets: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "price_stats.min" : "asc" } <1> + }, + "aggs" : { + "price_stats" : { "stats" : {} } <2> + } + } + } +} +-------------------------------------------------- + +<1> The `{ "price_stats.min" : asc" }` will sort the buckets based on `min` value of their `price_stats` sub-aggregation. + +<2> There is no need to configure the `price` field for the `price_stats` aggregation as it will inherit it by default from its parent histogram aggregation. + +It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long +as the aggregations path are of a single-bucket type, where the last aggregation in the path may either by a single-bucket +one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`), +in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of +a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value). + +The path must be defined in the following form: + +-------------------------------------------------- +AGG_SEPARATOR := '>' +METRIC_SEPARATOR := '.' +AGG_NAME := <the name of the aggregation> +METRIC := <the name of the metric (in case of multi-value metrics aggregation)> +PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>] +-------------------------------------------------- + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "promoted_products>rating_stats.avg" : "desc" } <1> + }, + "aggs" : { + "promoted_products" : { + "filter" : { "term" : { "promoted" : true }}, + "aggs" : { + "rating_stats" : { "stats" : { "field" : "rating" }} + } + } + } + } + } +} +-------------------------------------------------- + +The above will sort the buckets based on the avg rating among the promoted products + + +==== Offset + +By default the bucket keys start with 0 and then continue in even spaced steps of `interval`, e.g. if the interval is 10 the first buckets +(assuming there is data inside them) will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the `offset` option. + +This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval `10` will result in +two buckets with 5 documents each. If an additional offset `5` is used, there will be only one single bucket [5-14] containing all the 10 +documents. + +==== Response Format + +By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash +instead keyed by the buckets keys: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "keyed" : true + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "prices": { + "buckets": { + "0": { + "key": 0, + "doc_count": 2 + }, + "50": { + "key": 50, + "doc_count": 4 + }, + "150": { + "key": 150, + "doc_count": 3 + } + } + } + } +} +-------------------------------------------------- |