summaryrefslogtreecommitdiff
path: root/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc
diff options
context:
space:
mode:
Diffstat (limited to 'docs/reference/aggregations/bucket/histogram-aggregation.asciidoc')
-rw-r--r--docs/reference/aggregations/bucket/histogram-aggregation.asciidoc319
1 files changed, 319 insertions, 0 deletions
diff --git a/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc
new file mode 100644
index 0000000000..cd1fd06dda
--- /dev/null
+++ b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc
@@ -0,0 +1,319 @@
+[[search-aggregations-bucket-histogram-aggregation]]
+=== Histogram Aggregation
+
+A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
+It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
+that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5`
+(in case of price it may represent $5). When the aggregation executes, the price field of every document will be
+evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5`
+then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`.
+To make this more formal, here is the rounding function that is used:
+
+[source,java]
+--------------------------------------------------
+rem = value % interval
+if (rem < 0) {
+ rem += interval
+}
+bucket_key = value - rem
+--------------------------------------------------
+
+The following snippet "buckets" the products based on their `price` by interval of `50`:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+And the following may be the response:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggregations": {
+ "prices" : {
+ "buckets": [
+ {
+ "key": 0,
+ "doc_count": 2
+ },
+ {
+ "key": 50,
+ "doc_count": 4
+ },
+ {
+ "key": 100,
+ "doc_count": 0
+ },
+ {
+ "key": 150,
+ "doc_count": 3
+ }
+ ]
+ }
+ }
+}
+--------------------------------------------------
+
+==== Minimum document count
+
+The response above show that no documents has a price that falls within the range of `[100 - 150)`. By default the
+response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with
+a higher minimum count thanks to the `min_doc_count` setting:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50,
+ "min_doc_count" : 1
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+Response:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggregations": {
+ "prices" : {
+ "buckets": [
+ {
+ "key": 0,
+ "doc_count": 2
+ },
+ {
+ "key": 50,
+ "doc_count": 4
+ },
+ {
+ "key": 150,
+ "doc_count": 3
+ }
+ ]
+ }
+ }
+}
+--------------------------------------------------
+
+[[search-aggregations-bucket-histogram-aggregation-extended-bounds]]
+By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with
+the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the
+documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when
+requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.
+
+To understand why, let's look at an example:
+
+Lets say the you're filtering your request to get all docs with values between `0` and `500`, in addition you'd like
+to slice the data per price using a histogram with an interval of `50`. You also specify `"min_doc_count" : 0` as you'd
+like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than `100`,
+the first bucket you'll get will be the one with `100` as its key. This is confusing, as many times, you'd also like
+to get those buckets between `0 - 100`.
+
+With `extended_bounds` setting, you now can "force" the histogram aggregation to start building buckets on a specific
+`min` values and also keep on building buckets up to a `max` value (even if there are no documents anymore). Using
+`extended_bounds` only makes sense when `min_doc_count` is 0 (the empty buckets will never be returned if `min_doc_count`
+is greater than 0).
+
+Note that (as the name suggest) `extended_bounds` is **not** filtering buckets. Meaning, if the `extended_bounds.min` is higher
+than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the
+same goes for the `extended_bounds.max` and the last bucket). For filtering buckets, one should nest the histogram aggregation
+under a range `filter` aggregation with the appropriate `from`/`to` settings.
+
+Example:
+
+[source,js]
+--------------------------------------------------
+{
+ "query" : {
+ "filtered" : { "filter": { "range" : { "price" : { "to" : "500" } } } }
+ },
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50,
+ "extended_bounds" : {
+ "min" : 0,
+ "max" : 500
+ }
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+==== Order
+
+By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controled
+using the `order` setting.
+
+Ordering the buckets by their key - descending:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50,
+ "order" : { "_key" : "desc" }
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+Ordering the buckets by their `doc_count` - ascending:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50,
+ "order" : { "_count" : "asc" }
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine the order of the buckets:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50,
+ "order" : { "price_stats.min" : "asc" } <1>
+ },
+ "aggs" : {
+ "price_stats" : { "stats" : {} } <2>
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+<1> The `{ "price_stats.min" : asc" }` will sort the buckets based on `min` value of their `price_stats` sub-aggregation.
+
+<2> There is no need to configure the `price` field for the `price_stats` aggregation as it will inherit it by default from its parent histogram aggregation.
+
+It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long
+as the aggregations path are of a single-bucket type, where the last aggregation in the path may either by a single-bucket
+one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`),
+in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of
+a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).
+
+The path must be defined in the following form:
+
+--------------------------------------------------
+AGG_SEPARATOR := '>'
+METRIC_SEPARATOR := '.'
+AGG_NAME := <the name of the aggregation>
+METRIC := <the name of the metric (in case of multi-value metrics aggregation)>
+PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
+--------------------------------------------------
+
+[source,js]
+--------------------------------------------------
+{
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50,
+ "order" : { "promoted_products>rating_stats.avg" : "desc" } <1>
+ },
+ "aggs" : {
+ "promoted_products" : {
+ "filter" : { "term" : { "promoted" : true }},
+ "aggs" : {
+ "rating_stats" : { "stats" : { "field" : "rating" }}
+ }
+ }
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+The above will sort the buckets based on the avg rating among the promoted products
+
+
+==== Offset
+
+By default the bucket keys start with 0 and then continue in even spaced steps of `interval`, e.g. if the interval is 10 the first buckets
+(assuming there is data inside them) will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the `offset` option.
+
+This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval `10` will result in
+two buckets with 5 documents each. If an additional offset `5` is used, there will be only one single bucket [5-14] containing all the 10
+documents.
+
+==== Response Format
+
+By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash
+instead keyed by the buckets keys:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggs" : {
+ "prices" : {
+ "histogram" : {
+ "field" : "price",
+ "interval" : 50,
+ "keyed" : true
+ }
+ }
+ }
+}
+--------------------------------------------------
+
+Response:
+
+[source,js]
+--------------------------------------------------
+{
+ "aggregations": {
+ "prices": {
+ "buckets": {
+ "0": {
+ "key": 0,
+ "doc_count": 2
+ },
+ "50": {
+ "key": 50,
+ "doc_count": 4
+ },
+ "150": {
+ "key": 150,
+ "doc_count": 3
+ }
+ }
+ }
+ }
+}
+--------------------------------------------------