From e3ae1df6f0896bf3b835042525e1531c143cba93 Mon Sep 17 00:00:00 2001 From: Zachary Tong Date: Fri, 1 May 2015 16:04:55 -0400 Subject: [DOCS] Restructure Aggs documentation --- docs/reference/aggregations.asciidoc | 105 ++++ docs/reference/aggregations/bucket.asciidoc | 49 ++ .../bucket/children-aggregation.asciidoc | 344 +++++++++++ .../bucket/datehistogram-aggregation.asciidoc | 125 ++++ .../bucket/daterange-aggregation.asciidoc | 113 ++++ .../bucket/filter-aggregation.asciidoc | 38 ++ .../bucket/filters-aggregation.asciidoc | 128 ++++ .../bucket/geodistance-aggregation.asciidoc | 106 ++++ .../bucket/geohashgrid-aggregation.asciidoc | 131 ++++ .../bucket/global-aggregation.asciidoc | 51 ++ .../bucket/histogram-aggregation.asciidoc | 319 ++++++++++ .../bucket/iprange-aggregation.asciidoc | 98 +++ .../bucket/missing-aggregation.asciidoc | 34 ++ .../bucket/nested-aggregation.asciidoc | 67 +++ .../aggregations/bucket/range-aggregation.asciidoc | 277 +++++++++ .../bucket/reverse-nested-aggregation.asciidoc | 118 ++++ .../bucket/sampler-aggregation.asciidoc | 154 +++++ .../bucket/significantterms-aggregation.asciidoc | 524 ++++++++++++++++ .../aggregations/bucket/terms-aggregation.asciidoc | 657 +++++++++++++++++++++ docs/reference/aggregations/metrics.asciidoc | 48 ++ .../aggregations/metrics/avg-aggregation.asciidoc | 75 +++ .../metrics/cardinality-aggregation.asciidoc | 157 +++++ .../metrics/extendedstats-aggregation.asciidoc | 119 ++++ .../metrics/geobounds-aggregation.asciidoc | 53 ++ .../aggregations/metrics/max-aggregation.asciidoc | 69 +++ .../aggregations/metrics/min-aggregation.asciidoc | 68 +++ .../metrics/percentile-aggregation.asciidoc | 192 ++++++ .../metrics/percentile-rank-aggregation.asciidoc | 88 +++ .../metrics/scripted-metric-aggregation.asciidoc | 237 ++++++++ .../metrics/stats-aggregation.asciidoc | 81 +++ .../aggregations/metrics/sum-aggregation.asciidoc | 79 +++ .../metrics/tophits-aggregation.asciidoc | 275 +++++++++ .../metrics/valuecount-aggregation.asciidoc | 51 ++ docs/reference/aggregations/misc.asciidoc | 76 +++ docs/reference/aggregations/reducer.asciidoc | 160 +++++ .../reducer/derivative-aggregation.asciidoc | 196 ++++++ .../reducer/max-bucket-aggregation.asciidoc | 101 ++++ .../reducer/min-bucket-aggregation.asciidoc | 102 ++++ .../reducer/movavg-aggregation.asciidoc | 274 +++++++++ docs/reference/index.asciidoc | 2 + docs/reference/search.asciidoc | 2 - docs/reference/search/aggregations.asciidoc | 234 -------- docs/reference/search/aggregations/bucket.asciidoc | 33 -- .../bucket/children-aggregation.asciidoc | 344 ----------- .../bucket/datehistogram-aggregation.asciidoc | 125 ---- .../bucket/daterange-aggregation.asciidoc | 113 ---- .../bucket/filter-aggregation.asciidoc | 38 -- .../bucket/filters-aggregation.asciidoc | 128 ---- .../bucket/geodistance-aggregation.asciidoc | 106 ---- .../bucket/geohashgrid-aggregation.asciidoc | 131 ---- .../bucket/global-aggregation.asciidoc | 51 -- .../bucket/histogram-aggregation.asciidoc | 319 ---------- .../bucket/iprange-aggregation.asciidoc | 98 --- .../bucket/missing-aggregation.asciidoc | 34 -- .../bucket/nested-aggregation.asciidoc | 67 --- .../aggregations/bucket/range-aggregation.asciidoc | 277 --------- .../bucket/reverse-nested-aggregation.asciidoc | 118 ---- .../bucket/sampler-aggregation.asciidoc | 154 ----- .../bucket/significantterms-aggregation.asciidoc | 524 ---------------- .../aggregations/bucket/terms-aggregation.asciidoc | 657 --------------------- .../reference/search/aggregations/metrics.asciidoc | 27 - .../aggregations/metrics/avg-aggregation.asciidoc | 75 --- .../metrics/cardinality-aggregation.asciidoc | 157 ----- .../metrics/extendedstats-aggregation.asciidoc | 119 ---- .../metrics/geobounds-aggregation.asciidoc | 53 -- .../aggregations/metrics/max-aggregation.asciidoc | 69 --- .../aggregations/metrics/min-aggregation.asciidoc | 68 --- .../metrics/percentile-aggregation.asciidoc | 192 ------ .../metrics/percentile-rank-aggregation.asciidoc | 88 --- .../metrics/scripted-metric-aggregation.asciidoc | 237 -------- .../metrics/stats-aggregation.asciidoc | 81 --- .../aggregations/metrics/sum-aggregation.asciidoc | 79 --- .../metrics/tophits-aggregation.asciidoc | 275 --------- .../metrics/valuecount-aggregation.asciidoc | 51 -- .../reference/search/aggregations/reducer.asciidoc | 6 - .../reducer/derivative-aggregation.asciidoc | 194 ------ .../reducer/max-bucket-aggregation.asciidoc | 82 --- .../reducer/min-bucket-aggregation.asciidoc | 82 --- .../reducer/movavg-aggregation.asciidoc | 294 --------- 79 files changed, 5941 insertions(+), 5782 deletions(-) create mode 100644 docs/reference/aggregations.asciidoc create mode 100644 docs/reference/aggregations/bucket.asciidoc create mode 100644 docs/reference/aggregations/bucket/children-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/daterange-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/filter-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/filters-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/geodistance-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/geohashgrid-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/global-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/histogram-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/iprange-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/missing-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/nested-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/range-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/reverse-nested-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/sampler-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc create mode 100644 docs/reference/aggregations/bucket/terms-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics.asciidoc create mode 100644 docs/reference/aggregations/metrics/avg-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/cardinality-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/extendedstats-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/geobounds-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/max-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/min-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/percentile-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/scripted-metric-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/stats-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/sum-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/tophits-aggregation.asciidoc create mode 100644 docs/reference/aggregations/metrics/valuecount-aggregation.asciidoc create mode 100644 docs/reference/aggregations/misc.asciidoc create mode 100644 docs/reference/aggregations/reducer.asciidoc create mode 100644 docs/reference/aggregations/reducer/derivative-aggregation.asciidoc create mode 100644 docs/reference/aggregations/reducer/max-bucket-aggregation.asciidoc create mode 100644 docs/reference/aggregations/reducer/min-bucket-aggregation.asciidoc create mode 100644 docs/reference/aggregations/reducer/movavg-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/children-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/datehistogram-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/daterange-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/filter-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/filters-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/geodistance-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/geohashgrid-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/global-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/histogram-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/iprange-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/missing-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/nested-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/range-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/reverse-nested-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/sampler-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/bucket/terms-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/avg-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/cardinality-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/extendedstats-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/geobounds-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/max-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/min-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/percentile-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/percentile-rank-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/scripted-metric-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/stats-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/sum-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/tophits-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/metrics/valuecount-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/reducer.asciidoc delete mode 100644 docs/reference/search/aggregations/reducer/derivative-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/reducer/max-bucket-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/reducer/min-bucket-aggregation.asciidoc delete mode 100644 docs/reference/search/aggregations/reducer/movavg-aggregation.asciidoc diff --git a/docs/reference/aggregations.asciidoc b/docs/reference/aggregations.asciidoc new file mode 100644 index 0000000000..c6fb674834 --- /dev/null +++ b/docs/reference/aggregations.asciidoc @@ -0,0 +1,105 @@ +[[search-aggregations]] += Aggregations + +[partintro] +-- +The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks +called aggregations, that can be composed in order to build complex summaries of the data. + +An aggregation can be seen as a _unit-of-work_ that builds analytic information over a set of documents. The context of +the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed +query/filters of the search request). + +There are many different types of aggregations, each with its own purpose and output. To better understand these types, +it is often easier to break them into two main families: + +<>:: + A family of aggregations that build buckets, where each bucket is associated with a _key_ and a document + criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in + the context and when a criterion matches, the document is considered to "fall in" the relevant bucket. + By the end of the aggregation process, we'll end up with a list of buckets - each one with a set of + documents that "belong" to it. + +<>:: + Aggregations that keep track and compute metrics over a set of documents. + +<>:: + Aggregations that aggregate the output of other aggregations and their associated metrics + +The interesting part comes next. Since each bucket effectively defines a document set (all documents belonging to +the bucket), one can potentially associate aggregations on the bucket level, and those will execute within the context +of that bucket. This is where the real power of aggregations kicks in: *aggregations can be nested!* + +NOTE: Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub-aggregations will be computed for + the buckets which their parent aggregation generates. There is no hard limit on the level/depth of nested + aggregations (one can nest an aggregation under a "parent" aggregation, which is itself a sub-aggregation of + another higher-level aggregation). + +[float] +== Structuring Aggregations + +The following snippet captures the basic structure of aggregations: + +[source,js] +-------------------------------------------------- +"aggregations" : { + "" : { + "" : { + + } + [,"meta" : { [] } ]? + [,"aggregations" : { []+ } ]? + } + [,"" : { ... } ]* +} +-------------------------------------------------- + +The `aggregations` object (the key `aggs` can also be used) in the JSON holds the aggregations to be computed. Each aggregation +is associated with a logical name that the user defines (e.g. if the aggregation computes the average price, then it would +make sense to name it `avg_price`). These logical names will also be used to uniquely identify the aggregations in the +response. Each aggregation has a specific type (`` in the above snippet) and is typically the first +key within the named aggregation body. Each type of aggregation defines its own body, depending on the nature of the +aggregation (e.g. an `avg` aggregation on a specific field will define the field on which the average will be calculated). +At the same level of the aggregation type definition, one can optionally define a set of additional aggregations, +though this only makes sense if the aggregation you defined is of a bucketing nature. In this scenario, the +sub-aggregations you define on the bucketing aggregation level will be computed for all the buckets built by the +bucketing aggregation. For example, if you define a set of aggregations under the `range` aggregation, the +sub-aggregations will be computed for the range buckets that are defined. + +[float] +=== Values Source + +Some aggregations work on values extracted from the aggregated documents. Typically, the values will be extracted from +a specific document field which is set using the `field` key for the aggregations. It is also possible to define a +<> which will generate the values (per document). + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +When both `field` and `script` settings are configured for the aggregation, the script will be treated as a +`value script`. While normal scripts are evaluated on a document level (i.e. the script has access to all the data +associated with the document), value scripts are evaluated on the *value* level. In this mode, the values are extracted +from the configured `field` and the `script` is used to apply a "transformation" over these value/s. + +["NOTE",id="aggs-script-note"] +=============================== +When working with scripts, the `lang` and `params` settings can also be defined. The former defines the scripting +language which is used (assuming the proper language is available in Elasticsearch, either by default or as a plugin). The latter +enables defining all the "dynamic" expressions in the script as parameters, which enables the script to keep itself static +between calls (this will ensure the use of the cached compiled scripts in Elasticsearch). +=============================== + +Scripts can generate a single value or multiple values per document. When generating multiple values, one can use the +`script_values_sorted` settings to indicate whether these values are sorted or not. Internally, Elasticsearch can +perform optimizations when dealing with sorted values (for example, with the `min` aggregations, knowing the values are +sorted, Elasticsearch will skip the iterations over all the values and rely on the first value in the list to be the +minimum value among all other values associated with the same document). + +-- + +include::aggregations/metrics.asciidoc[] + +include::aggregations/bucket.asciidoc[] + +include::aggregations/reducer.asciidoc[] + +include::aggregations/misc.asciidoc[] diff --git a/docs/reference/aggregations/bucket.asciidoc b/docs/reference/aggregations/bucket.asciidoc new file mode 100644 index 0000000000..2d185dd49a --- /dev/null +++ b/docs/reference/aggregations/bucket.asciidoc @@ -0,0 +1,49 @@ +[[search-aggregations-bucket]] +== Bucket Aggregations + +Bucket aggregations don't calculate metrics over fields like the metrics aggregations do, but instead, they create +buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines +whether or not a document in the current context "falls" into it. In other words, the buckets effectively define document +sets. In addition to the buckets themselves, the `bucket` aggregations also compute and return the number of documents +that "fell in" to each bucket. + +Bucket aggregations, as opposed to `metrics` aggregations, can hold sub-aggregations. These sub-aggregations will be +aggregated for the buckets created by their "parent" bucket aggregation. + +There are different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some +define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process. + +include::bucket/children-aggregation.asciidoc[] + +include::bucket/datehistogram-aggregation.asciidoc[] + +include::bucket/daterange-aggregation.asciidoc[] + +include::bucket/filter-aggregation.asciidoc[] + +include::bucket/filters-aggregation.asciidoc[] + +include::bucket/geodistance-aggregation.asciidoc[] + +include::bucket/geohashgrid-aggregation.asciidoc[] + +include::bucket/global-aggregation.asciidoc[] + +include::bucket/histogram-aggregation.asciidoc[] + +include::bucket/iprange-aggregation.asciidoc[] + +include::bucket/missing-aggregation.asciidoc[] + +include::bucket/nested-aggregation.asciidoc[] + +include::bucket/range-aggregation.asciidoc[] + +include::bucket/reverse-nested-aggregation.asciidoc[] + +include::bucket/sampler-aggregation.asciidoc[] + +include::bucket/significantterms-aggregation.asciidoc[] + +include::bucket/terms-aggregation.asciidoc[] + diff --git a/docs/reference/aggregations/bucket/children-aggregation.asciidoc b/docs/reference/aggregations/bucket/children-aggregation.asciidoc new file mode 100644 index 0000000000..e69877d97f --- /dev/null +++ b/docs/reference/aggregations/bucket/children-aggregation.asciidoc @@ -0,0 +1,344 @@ +[[search-aggregations-bucket-children-aggregation]] +=== Children Aggregation + +A special single bucket aggregation that enables aggregating from buckets on parent document types to buckets on child documents. + +This aggregation relies on the <> in the mapping. This aggregation has a single option: + +* `type` - The what child type the buckets in the parent space should be mapped to. + +For example, let's say we have an index of questions and answers. The answer type has the following `_parent` field in the mapping: +[source,js] +-------------------------------------------------- +{ + "answer" : { + "_parent" : { + "type" : "question" + } + } +} +-------------------------------------------------- + +The question typed document contain a tag field and the answer typed documents contain an owner field. With the `children` +aggregation the tag buckets can be mapped to the owner buckets in a single request even though the two fields exist in +two different kinds of documents. + +An example of a question typed document: +[source,js] +-------------------------------------------------- +{ + "body": "

I have Windows 2003 server and i bought a new Windows 2008 server...", + "title": "Whats the best way to file transfer my site from server to a newer one?", + "tags": [ + "windows-server-2003", + "windows-server-2008", + "file-transfer" + ], +} +-------------------------------------------------- + +An example of an answer typed document: +[source,js] +-------------------------------------------------- +{ + "owner": { + "location": "Norfolk, United Kingdom", + "display_name": "Sam", + "id": 48 + }, + "body": "

Unfortunately your pretty much limited to FTP...", + "creation_date": "2009-05-04T13:45:37.030" +} +-------------------------------------------------- + +The following request can be built that connects the two together: + +[source,js] +-------------------------------------------------- +{ + "aggs": { + "top-tags": { + "terms": { + "field": "tags", + "size": 10 + }, + "aggs": { + "to-answers": { + "children": { + "type" : "answer" <1> + }, + "aggs": { + "top-names": { + "terms": { + "field": "owner.display_name", + "size": 10 + } + } + } + } + } + } + } +} +-------------------------------------------------- + +<1> The `type` points to type / mapping with the name `answer`. + +The above example returns the top question tags and per tag the top answer owners. + +Possible response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "top-tags": { + "buckets": [ + { + "key": "windows-server-2003", + "doc_count": 25365, <1> + "to-answers": { + "doc_count": 36004, <2> + "top-names": { + "buckets": [ + { + "key": "Sam", + "doc_count": 274 + }, + { + "key": "chris", + "doc_count": 19 + }, + { + "key": "david", + "doc_count": 14 + }, + ... + ] + } + } + }, + { + "key": "linux", + "doc_count": 18342, + "to-answers": { + "doc_count": 6655, + "top-names": { + "buckets": [ + { + "key": "abrams", + "doc_count": 25 + }, + { + "key": "ignacio", + "doc_count": 25 + }, + { + "key": "vazquez", + "doc_count": 25 + }, + ... + ] + } + } + }, + { + "key": "windows", + "doc_count": 18119, + "to-answers": { + "doc_count": 24051, + "top-names": { + "buckets": [ + { + "key": "molly7244", + "doc_count": 265 + }, + { + "key": "david", + "doc_count": 27 + }, + { + "key": "chris", + "doc_count": 26 + }, + ... + ] + } + } + }, + { + "key": "osx", + "doc_count": 10971, + "to-answers": { + "doc_count": 5902, + "top-names": { + "buckets": [ + { + "key": "diago", + "doc_count": 4 + }, + { + "key": "albert", + "doc_count": 3 + }, + { + "key": "asmus", + "doc_count": 3 + }, + ... + ] + } + } + }, + { + "key": "ubuntu", + "doc_count": 8743, + "to-answers": { + "doc_count": 8784, + "top-names": { + "buckets": [ + { + "key": "ignacio", + "doc_count": 9 + }, + { + "key": "abrams", + "doc_count": 8 + }, + { + "key": "molly7244", + "doc_count": 8 + }, + ... + ] + } + } + }, + { + "key": "windows-xp", + "doc_count": 7517, + "to-answers": { + "doc_count": 13610, + "top-names": { + "buckets": [ + { + "key": "molly7244", + "doc_count": 232 + }, + { + "key": "chris", + "doc_count": 9 + }, + { + "key": "john", + "doc_count": 9 + }, + ... + ] + } + } + }, + { + "key": "networking", + "doc_count": 6739, + "to-answers": { + "doc_count": 2076, + "top-names": { + "buckets": [ + { + "key": "molly7244", + "doc_count": 6 + }, + { + "key": "alnitak", + "doc_count": 5 + }, + { + "key": "chris", + "doc_count": 3 + }, + ... + ] + } + } + }, + { + "key": "mac", + "doc_count": 5590, + "to-answers": { + "doc_count": 999, + "top-names": { + "buckets": [ + { + "key": "abrams", + "doc_count": 2 + }, + { + "key": "ignacio", + "doc_count": 2 + }, + { + "key": "vazquez", + "doc_count": 2 + }, + ... + ] + } + } + }, + { + "key": "wireless-networking", + "doc_count": 4409, + "to-answers": { + "doc_count": 6497, + "top-names": { + "buckets": [ + { + "key": "molly7244", + "doc_count": 61 + }, + { + "key": "chris", + "doc_count": 5 + }, + { + "key": "mike", + "doc_count": 5 + }, + ... + ] + } + } + }, + { + "key": "windows-8", + "doc_count": 3601, + "to-answers": { + "doc_count": 4263, + "top-names": { + "buckets": [ + { + "key": "molly7244", + "doc_count": 3 + }, + { + "key": "msft", + "doc_count": 2 + }, + { + "key": "user172132", + "doc_count": 2 + }, + ... + ] + } + } + } + ] + } + } +} +-------------------------------------------------- + +<1> The number of question documents with the tag `windows-server-2003`. +<2> The number of answer documents that are related to question documents with the tag `windows-server-2003`. diff --git a/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc b/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc new file mode 100644 index 0000000000..256ef62d76 --- /dev/null +++ b/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc @@ -0,0 +1,125 @@ +[[search-aggregations-bucket-datehistogram-aggregation]] +=== Date Histogram Aggregation + +A multi-bucket aggregation similar to the <> except it can +only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible +to use the normal `histogram` on dates as well, though accuracy will be compromised. The reason for this is in the fact +that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, +we need special support for time based data. From a functionality perspective, this histogram supports the same features +as the normal <>. The main difference is that the interval can be specified by date/time expressions. + +Requesting bucket intervals of a month. + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "articles_over_time" : { + "date_histogram" : { + "field" : "date", + "interval" : "month" + } + } + } +} +-------------------------------------------------- + +Available expressions for interval: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second` + + +Fractional values are allowed for seconds, minutes, hours, days and weeks. For example 1.5 hours: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "articles_over_time" : { + "date_histogram" : { + "field" : "date", + "interval" : "1.5h" + } + } + } +} +-------------------------------------------------- + +See <> for accepted abbreviations. + +==== Time Zone + +By default, times are stored as UTC milliseconds since the epoch. Thus, all computation and "bucketing" / "rounding" is +done on UTC. It is possible to provide a time zone value, which will cause all bucket +computations to take place in the specified zone. The time returned for each bucket/entry is milliseconds since the +epoch in UTC. The parameters is called `time_zone`. It accepts either a numeric value for the hours offset, for example: +`"time_zone" : -2`. It also accepts a format of hours and minutes, like `"time_zone" : "-02:30"`. +Another option is to provide a time zone accepted as one of the values listed here. + +Lets take an example. For `2012-04-01T04:15:30Z` (UTC), with a `time_zone` of `"-08:00"`. For day interval, the actual time by +applying the time zone and rounding falls under `2012-03-31`, so the returned value will be (in millis) of +`2012-03-31T08:00:00Z` (UTC). For hour interval, internally applying the time zone results in `2012-03-31T20:15:30`, so rounding it +in the time zone results in `2012-03-31T20:00:00`, but we return that rounded value converted back in UTC so be consistent as +`2012-04-01T04:00:00Z` (UTC). + +==== Offset + +The `offset` option can be provided for shifting the date bucket intervals boundaries after any other shifts because of +time zones are applies. This for example makes it possible that daily buckets go from 6AM to 6AM the next day instead of starting at 12AM +or that monthly buckets go from the 10th of the month to the 10th of the next month instead of the 1st. + +The `offset` option accepts positive or negative time durations like "1h" for an hour or "1M" for a Month. See <> for more +possible time duration options. + +==== Keys + +Since internally, dates are represented as 64bit numbers, these numbers are returned as the bucket keys (each key +representing a date - milliseconds since the epoch). It is also possible to define a date format, which will result in +returning the dates as formatted strings next to the numeric key values: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "articles_over_time" : { + "date_histogram" : { + "field" : "date", + "interval" : "1M", + "format" : "yyyy-MM-dd" <1> + } + } + } +} +-------------------------------------------------- + +<1> Supports expressive date <> + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "articles_over_time": { + "buckets": [ + { + "key_as_string": "2013-02-02", + "key": 1328140800000, + "doc_count": 1 + }, + { + "key_as_string": "2013-03-02", + "key": 1330646400000, + "doc_count": 2 + }, + ... + ] + } + } +} +-------------------------------------------------- + +Like with the normal <>, both document level scripts and +value level scripts are supported. It is also possible to control the order of the returned buckets using the `order` +settings and filter the returned buckets based on a `min_doc_count` setting (by default all buckets between the first +bucket that matches documents and the last one are returned). This histogram also supports the `extended_bounds` +setting, which enables extending the bounds of the histogram beyond the data itself (to read more on why you'd want to +do that please refer to the explanation <>). diff --git a/docs/reference/aggregations/bucket/daterange-aggregation.asciidoc b/docs/reference/aggregations/bucket/daterange-aggregation.asciidoc new file mode 100644 index 0000000000..7c5d6cc86f --- /dev/null +++ b/docs/reference/aggregations/bucket/daterange-aggregation.asciidoc @@ -0,0 +1,113 @@ +[[search-aggregations-bucket-daterange-aggregation]] +=== Date Range Aggregation + +A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal <> aggregation is that the `from` and `to` values can be expressed in <> expressions, and it is also possible to specify a date format by which the `from` and `to` response fields will be returned. +Note that this aggregration includes the `from` value and excludes the `to` value for each range. + +Example: + +[source,js] +-------------------------------------------------- +{ + "aggs": { + "range": { + "date_range": { + "field": "date", + "format": "MM-yyy", + "ranges": [ + { "to": "now-10M/M" }, <1> + { "from": "now-10M/M" } <2> + ] + } + } + } +} +-------------------------------------------------- +<1> < now minus 10 months, rounded down to the start of the month. +<2> >= now minus 10 months, rounded down to the start of the month. + +In the example above, we created two range buckets, the first will "bucket" all documents dated prior to 10 months ago and +the second will "bucket" all documents dated since 10 months ago + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "range": { + "buckets": [ + { + "to": 1.3437792E+12, + "to_as_string": "08-2012", + "doc_count": 7 + }, + { + "from": 1.3437792E+12, + "from_as_string": "08-2012", + "doc_count": 2 + } + ] + } + } +} +-------------------------------------------------- + +[[date-format-pattern]] +==== Date Format/Pattern + +NOTE: this information was copied from http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html[JodaDate] + +All ASCII letters are reserved as format pattern letters, which are defined as follows: + +[options="header"] +|======= +|Symbol |Meaning |Presentation |Examples +|G |era |text |AD +|C |century of era (>=0) |number |20 +|Y |year of era (>=0) |year |1996 + +|x |weekyear |year |1996 +|w |week of weekyear |number |27 +|e |day of week |number |2 +|E |day of week |text |Tuesday; Tue + +|y |year |year |1996 +|D |day of year |number |189 +|M |month of year |month |July; Jul; 07 +|d |day of month |number |10 + +|a |halfday of day |text |PM +|K |hour of halfday (0~11) |number |0 +|h |clockhour of halfday (1~12) |number |12 + +|H |hour of day (0~23) |number |0 +|k |clockhour of day (1~24) |number |24 +|m |minute of hour |number |30 +|s |second of minute |number |55 +|S |fraction of second |number |978 + +|z |time zone |text |Pacific Standard Time; PST +|Z |time zone offset/id |zone |-0800; -08:00; America/Los_Angeles + +|' |escape for text |delimiter +|'' |single quote |literal |' +|======= + +The count of pattern letters determine the format. + +Text:: If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available. + +Number:: The minimum number of digits. Shorter numbers are zero-padded to this amount. + +Year:: Numeric presentation for year and weekyear fields are handled specially. For example, if the count of 'y' is 2, the year will be displayed as the zero-based year of the century, which is two digits. + +Month:: 3 or over, use text, otherwise use number. + +Zone:: 'Z' outputs offset without a colon, 'ZZ' outputs the offset with a colon, 'ZZZ' or more outputs the zone id. + +Zone names:: Time zone names ('z') cannot be parsed. + +Any characters in the pattern that are not in the ranges of ['a'..'z'] and ['A'..'Z'] will be treated as quoted text. For instance, characters like ':', '.', ' ', '#' and '?' will appear in the resulting time text even they are not embraced within single quotes. diff --git a/docs/reference/aggregations/bucket/filter-aggregation.asciidoc b/docs/reference/aggregations/bucket/filter-aggregation.asciidoc new file mode 100644 index 0000000000..cc2e104354 --- /dev/null +++ b/docs/reference/aggregations/bucket/filter-aggregation.asciidoc @@ -0,0 +1,38 @@ +[[search-aggregations-bucket-filter-aggregation]] +=== Filter Aggregation + +Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents. + +Example: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "in_stock_products" : { + "filter" : { "range" : { "stock" : { "gt" : 0 } } }, + "aggs" : { + "avg_price" : { "avg" : { "field" : "price" } } + } + } + } +} +-------------------------------------------------- + +In the above example, we calculate the average price of all the products that are currently in-stock. + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggs" : { + "in_stock_products" : { + "doc_count" : 100, + "avg_price" : { "value" : 56.3 } + } + } +} +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/bucket/filters-aggregation.asciidoc b/docs/reference/aggregations/bucket/filters-aggregation.asciidoc new file mode 100644 index 0000000000..2553758d77 --- /dev/null +++ b/docs/reference/aggregations/bucket/filters-aggregation.asciidoc @@ -0,0 +1,128 @@ +[[search-aggregations-bucket-filters-aggregation]] +=== Filters Aggregation + +Defines a multi bucket aggregations where each bucket is associated with a +filter. Each bucket will collect all documents that match its associated +filter. + +Example: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "messages" : { + "filters" : { + "filters" : { + "errors" : { "term" : { "body" : "error" }}, + "warnings" : { "term" : { "body" : "warning" }} + } + }, + "aggs" : { + "monthly" : { + "histogram" : { + "field" : "timestamp", + "interval" : "1M" + } + } + } + } + } +} +-------------------------------------------------- + +In the above example, we analyze log messages. The aggregation will build two +collection (buckets) of log messages - one for all those containing an error, +and another for all those containing a warning. And for each of these buckets +it will break them down by month. + +Response: + +[source,js] +-------------------------------------------------- +... + "aggs" : { + "messages" : { + "buckets" : { + "errors" : { + "doc_count" : 34, + "monthly" : { + "buckets : [ + ... // the histogram monthly breakdown + ] + } + }, + "warnings" : { + "doc_count" : 439, + "monthly" : { + "buckets : [ + ... // the histogram monthly breakdown + ] + } + } + } + } + } + } +... +-------------------------------------------------- + +==== Anonymous filters + +The filters field can also be provided as an array of filters, as in the +following request: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "messages" : { + "filters" : { + "filters" : [ + { "term" : { "body" : "error" }}, + { "term" : { "body" : "warning" }} + ] + }, + "aggs" : { + "monthly" : { + "histogram" : { + "field" : "timestamp", + "interval" : "1M" + } + } + } + } + } +} +-------------------------------------------------- + +The filtered buckets are returned in the same order as provided in the +request. The response for this example would be: + +[source,js] +-------------------------------------------------- +... + "aggs" : { + "messages" : { + "buckets" : [ + { + "doc_count" : 34, + "monthly" : { + "buckets : [ + ... // the histogram monthly breakdown + ] + } + }, + { + "doc_count" : 439, + "monthly" : { + "buckets : [ + ... // the histogram monthly breakdown + ] + } + } + ] + } + } +... +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/bucket/geodistance-aggregation.asciidoc b/docs/reference/aggregations/bucket/geodistance-aggregation.asciidoc new file mode 100644 index 0000000000..2120c0bec9 --- /dev/null +++ b/docs/reference/aggregations/bucket/geodistance-aggregation.asciidoc @@ -0,0 +1,106 @@ +[[search-aggregations-bucket-geodistance-aggregation]] +=== Geo Distance Aggregation + +A multi-bucket aggregation that works on `geo_point` fields and conceptually works very similar to the <> aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket). + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "rings_around_amsterdam" : { + "geo_distance" : { + "field" : "location", + "origin" : "52.3760, 4.894", + "ranges" : [ + { "to" : 100 }, + { "from" : 100, "to" : 300 }, + { "from" : 300 } + ] + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "rings" : { + "buckets": [ + { + "key": "*-100.0", + "from": 0, + "to": 100.0, + "doc_count": 3 + }, + { + "key": "100.0-300.0", + "from": 100.0, + "to": 300.0, + "doc_count": 1 + }, + { + "key": "300.0-*", + "from": 300.0, + "doc_count": 7 + } + ] + } + } +} +-------------------------------------------------- + +The specified field must be of type `geo_point` (which can only be set explicitly in the mappings). And it can also hold an array of `geo_point` fields, in which case all will be taken into account during aggregation. The origin point can accept all formats supported by the `geo_point` <>: + +* Object format: `{ "lat" : 52.3760, "lon" : 4.894 }` - this is the safest format as it is the most explicit about the `lat` & `lon` values +* String format: `"52.3760, 4.894"` - where the first number is the `lat` and the second is the `lon` +* Array format: `[4.894, 52.3760]` - which is based on the `GeoJson` standard and where the first number is the `lon` and the second one is the `lat` + +By default, the distance unit is `m` (metres) but it can also accept: `mi` (miles), `in` (inches), `yd` (yards), `km` (kilometers), `cm` (centimeters), `mm` (millimeters). + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "rings" : { + "geo_distance" : { + "field" : "location", + "origin" : "52.3760, 4.894", + "unit" : "mi", <1> + "ranges" : [ + { "to" : 100 }, + { "from" : 100, "to" : 300 }, + { "from" : 300 } + ] + } + } + } +} +-------------------------------------------------- + +<1> The distances will be computed as miles + +There are three distance calculation modes: `sloppy_arc` (the default), `arc` (most accurate) and `plane` (fastest). The `arc` calculation is the most accurate one but also the more expensive one in terms of performance. The `sloppy_arc` is faster but less accurate. The `plane` is the fastest but least accurate distance function. Consider using `plane` when your search context is "narrow" and spans smaller geographical areas (like cities or even countries). `plane` may return higher error mergins for searches across very large areas (e.g. cross continent search). The distance calculation type can be set using the `distance_type` parameter: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "rings" : { + "geo_distance" : { + "field" : "location", + "origin" : "52.3760, 4.894", + "distance_type" : "plane", + "ranges" : [ + { "to" : 100 }, + { "from" : 100, "to" : 300 }, + { "from" : 300 } + ] + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/bucket/geohashgrid-aggregation.asciidoc b/docs/reference/aggregations/bucket/geohashgrid-aggregation.asciidoc new file mode 100644 index 0000000000..e74e3e96d1 --- /dev/null +++ b/docs/reference/aggregations/bucket/geohashgrid-aggregation.asciidoc @@ -0,0 +1,131 @@ +[[search-aggregations-bucket-geohashgrid-aggregation]] +=== GeoHash grid Aggregation + +A multi-bucket aggregation that works on `geo_point` fields and groups points into buckets that represent cells in a grid. +The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a http://en.wikipedia.org/wiki/Geohash[geohash] which is of user-definable precision. + +* High precision geohashes have a long string length and represent cells that cover only a small area. +* Low precision geohashes have a short string length and represent cells that each cover a large area. + +Geohashes used in this aggregation can have a choice of precision between 1 and 12. + +WARNING: The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes. +Please see the example below on how to first filter the aggregation to a smaller geographic area before requesting high-levels of detail. + +The specified field must be of type `geo_point` (which can only be set explicitly in the mappings) and it can also hold an array of `geo_point` fields, in which case all points will be taken into account during aggregation. + + +==== Simple low-precision request + +[source,js] +-------------------------------------------------- +{ + "aggregations" : { + "myLarge-GrainGeoHashGrid" : { + "geohash_grid" : { + "field" : "location", + "precision" : 3 + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "myLarge-GrainGeoHashGrid": { + "buckets": [ + { + "key": "svz", + "doc_count": 10964 + }, + { + "key": "sv8", + "doc_count": 3198 + } + ] + } + } +} +-------------------------------------------------- + + + +==== High-precision requests + +When requesting detailed buckets (typically for displaying a "zoomed in" map) a filter like <> should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned. + +[source,js] +-------------------------------------------------- +{ + "aggregations" : { + "zoomedInView" : { + "filter" : { + "geo_bounding_box" : { + "location" : { + "top_left" : "51.73, 0.9", + "bottom_right" : "51.55, 1.1" + } + } + }, + "aggregations":{ + "zoom1":{ + "geohash_grid" : { + "field":"location", + "precision":8, + } + } + } + } + } + } +-------------------------------------------------- + +==== Cell dimensions at the equator +The table below shows the metric dimensions for cells covered by various string lengths of geohash. +Cell dimensions vary with latitude and so the table is for the worst-case scenario at the equator. + +[horizontal] +*GeoHash length*:: *Area width x height* +1:: 5,009.4km x 4,992.6km +2:: 1,252.3km x 624.1km +3:: 156.5km x 156km +4:: 39.1km x 19.5km +5:: 4.9km x 4.9km +6:: 1.2km x 609.4m +7:: 152.9m x 152.4m +8:: 38.2m x 19m +9:: 4.8m x 4.8m +10:: 1.2m x 59.5cm +11:: 14.9cm x 14.9cm +12:: 3.7cm x 1.9cm + + + +==== Options + +[horizontal] +field:: Mandatory. The name of the field indexed with GeoPoints. + +precision:: Optional. The string length of the geohashes used to define + cells/buckets in the results. Defaults to 5. + +size:: Optional. The maximum number of geohash buckets to return + (defaults to 10,000). When results are trimmed, buckets are + prioritised based on the volumes of documents they contain. + A value of `0` will return all buckets that + contain a hit, use with caution as this could use a lot of CPU + and network bandwith if there are many buckets. + +shard_size:: Optional. To allow for more accurate counting of the top cells + returned in the final result the aggregation defaults to + returning `max(10,(size x number-of-shards))` buckets from each + shard. If this heuristic is undesirable, the number considered + from each shard can be over-ridden using this parameter. + A value of `0` makes the shard size unlimited. + + diff --git a/docs/reference/aggregations/bucket/global-aggregation.asciidoc b/docs/reference/aggregations/bucket/global-aggregation.asciidoc new file mode 100644 index 0000000000..fa500e1ff8 --- /dev/null +++ b/docs/reference/aggregations/bucket/global-aggregation.asciidoc @@ -0,0 +1,51 @@ +[[search-aggregations-bucket-global-aggregation]] +=== Global Aggregation + +Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you're searching on, but is *not* influenced by the search query itself. + +NOTE: Global aggregators can only be placed as top level aggregators (it makes no sense to embed a global aggregator + within another bucket aggregator) + +Example: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "match" : { "title" : "shirt" } + }, + "aggs" : { + "all_products" : { + "global" : {}, <1> + "aggs" : { <2> + "avg_price" : { "avg" : { "field" : "price" } } + } + } + } +} +-------------------------------------------------- + +<1> The `global` aggregation has an empty body +<2> The sub-aggregations that are registered for this `global` aggregation + +The above aggregation demonstrates how one would compute aggregations (`avg_price` in this example) on all the documents in the search context, regardless of the query (in our example, it will compute the average price over all products in our catalog, not just on the "shirts"). + +The response for the above aggreation: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations" : { + "all_products" : { + "doc_count" : 100, <1> + "avg_price" : { + "value" : 56.3 + } + } + } +} +-------------------------------------------------- + +<1> The number of documents that were aggregated (in our case, all documents within the search context) diff --git a/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc new file mode 100644 index 0000000000..cd1fd06dda --- /dev/null +++ b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc @@ -0,0 +1,319 @@ +[[search-aggregations-bucket-histogram-aggregation]] +=== Histogram Aggregation + +A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. +It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field +that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5` +(in case of price it may represent $5). When the aggregation executes, the price field of every document will be +evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5` +then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`. +To make this more formal, here is the rounding function that is used: + +[source,java] +-------------------------------------------------- +rem = value % interval +if (rem < 0) { + rem += interval +} +bucket_key = value - rem +-------------------------------------------------- + +The following snippet "buckets" the products based on their `price` by interval of `50`: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50 + } + } + } +} +-------------------------------------------------- + +And the following may be the response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "prices" : { + "buckets": [ + { + "key": 0, + "doc_count": 2 + }, + { + "key": 50, + "doc_count": 4 + }, + { + "key": 100, + "doc_count": 0 + }, + { + "key": 150, + "doc_count": 3 + } + ] + } + } +} +-------------------------------------------------- + +==== Minimum document count + +The response above show that no documents has a price that falls within the range of `[100 - 150)`. By default the +response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with +a higher minimum count thanks to the `min_doc_count` setting: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "min_doc_count" : 1 + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "prices" : { + "buckets": [ + { + "key": 0, + "doc_count": 2 + }, + { + "key": 50, + "doc_count": 4 + }, + { + "key": 150, + "doc_count": 3 + } + ] + } + } +} +-------------------------------------------------- + +[[search-aggregations-bucket-histogram-aggregation-extended-bounds]] +By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with +the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the +documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when +requesting empty buckets, this causes a confusion, specifically, when the data is also filtered. + +To understand why, let's look at an example: + +Lets say the you're filtering your request to get all docs with values between `0` and `500`, in addition you'd like +to slice the data per price using a histogram with an interval of `50`. You also specify `"min_doc_count" : 0` as you'd +like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than `100`, +the first bucket you'll get will be the one with `100` as its key. This is confusing, as many times, you'd also like +to get those buckets between `0 - 100`. + +With `extended_bounds` setting, you now can "force" the histogram aggregation to start building buckets on a specific +`min` values and also keep on building buckets up to a `max` value (even if there are no documents anymore). Using +`extended_bounds` only makes sense when `min_doc_count` is 0 (the empty buckets will never be returned if `min_doc_count` +is greater than 0). + +Note that (as the name suggest) `extended_bounds` is **not** filtering buckets. Meaning, if the `extended_bounds.min` is higher +than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the +same goes for the `extended_bounds.max` and the last bucket). For filtering buckets, one should nest the histogram aggregation +under a range `filter` aggregation with the appropriate `from`/`to` settings. + +Example: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "filtered" : { "filter": { "range" : { "price" : { "to" : "500" } } } } + }, + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "extended_bounds" : { + "min" : 0, + "max" : 500 + } + } + } + } +} +-------------------------------------------------- + +==== Order + +By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controled +using the `order` setting. + +Ordering the buckets by their key - descending: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "_key" : "desc" } + } + } + } +} +-------------------------------------------------- + +Ordering the buckets by their `doc_count` - ascending: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "_count" : "asc" } + } + } + } +} +-------------------------------------------------- + +If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine the order of the buckets: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "price_stats.min" : "asc" } <1> + }, + "aggs" : { + "price_stats" : { "stats" : {} } <2> + } + } + } +} +-------------------------------------------------- + +<1> The `{ "price_stats.min" : asc" }` will sort the buckets based on `min` value of their `price_stats` sub-aggregation. + +<2> There is no need to configure the `price` field for the `price_stats` aggregation as it will inherit it by default from its parent histogram aggregation. + +It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long +as the aggregations path are of a single-bucket type, where the last aggregation in the path may either by a single-bucket +one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`), +in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of +a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value). + +The path must be defined in the following form: + +-------------------------------------------------- +AGG_SEPARATOR := '>' +METRIC_SEPARATOR := '.' +AGG_NAME := +METRIC := +PATH := []*[] +-------------------------------------------------- + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "order" : { "promoted_products>rating_stats.avg" : "desc" } <1> + }, + "aggs" : { + "promoted_products" : { + "filter" : { "term" : { "promoted" : true }}, + "aggs" : { + "rating_stats" : { "stats" : { "field" : "rating" }} + } + } + } + } + } +} +-------------------------------------------------- + +The above will sort the buckets based on the avg rating among the promoted products + + +==== Offset + +By default the bucket keys start with 0 and then continue in even spaced steps of `interval`, e.g. if the interval is 10 the first buckets +(assuming there is data inside them) will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the `offset` option. + +This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval `10` will result in +two buckets with 5 documents each. If an additional offset `5` is used, there will be only one single bucket [5-14] containing all the 10 +documents. + +==== Response Format + +By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash +instead keyed by the buckets keys: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "prices" : { + "histogram" : { + "field" : "price", + "interval" : 50, + "keyed" : true + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "prices": { + "buckets": { + "0": { + "key": 0, + "doc_count": 2 + }, + "50": { + "key": 50, + "doc_count": 4 + }, + "150": { + "key": 150, + "doc_count": 3 + } + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/bucket/iprange-aggregation.asciidoc b/docs/reference/aggregations/bucket/iprange-aggregation.asciidoc new file mode 100644 index 0000000000..6d06743644 --- /dev/null +++ b/docs/reference/aggregations/bucket/iprange-aggregation.asciidoc @@ -0,0 +1,98 @@ +[[search-aggregations-bucket-iprange-aggregation]] +=== IPv4 Range Aggregation + +Just like the dedicated <> range aggregation, there is also a dedicated range aggregation for IPv4 typed fields: + +Example: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "ip_ranges" : { + "ip_range" : { + "field" : "ip", + "ranges" : [ + { "to" : "10.0.0.5" }, + { "from" : "10.0.0.5" } + ] + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "ip_ranges": { + "buckets" : [ + { + "to": 167772165, + "to_as_string": "10.0.0.5", + "doc_count": 4 + }, + { + "from": 167772165, + "from_as_string": "10.0.0.5", + "doc_count": 6 + } + ] + } + } +} +-------------------------------------------------- + +IP ranges can also be defined as CIDR masks: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "ip_ranges" : { + "ip_range" : { + "field" : "ip", + "ranges" : [ + { "mask" : "10.0.0.0/25" }, + { "mask" : "10.0.0.127/25" } + ] + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "ip_ranges": { + "buckets": [ + { + "key": "10.0.0.0/25", + "from": 1.6777216E+8, + "from_as_string": "10.0.0.0", + "to": 167772287, + "to_as_string": "10.0.0.127", + "doc_count": 127 + }, + { + "key": "10.0.0.127/25", + "from": 1.6777216E+8, + "from_as_string": "10.0.0.0", + "to": 167772287, + "to_as_string": "10.0.0.127", + "doc_count": 127 + } + ] + } + } +} +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/bucket/missing-aggregation.asciidoc b/docs/reference/aggregations/bucket/missing-aggregation.asciidoc new file mode 100644 index 0000000000..f0b8fb4ac3 --- /dev/null +++ b/docs/reference/aggregations/bucket/missing-aggregation.asciidoc @@ -0,0 +1,34 @@ +[[search-aggregations-bucket-missing-aggregation]] +=== Missing Aggregation + +A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values. + +Example: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "products_without_a_price" : { + "missing" : { "field" : "price" } + } + } +} +-------------------------------------------------- + +In the above example, we get the total number of products that do not have a price. + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggs" : { + "products_without_a_price" : { + "doc_count" : 10 + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/bucket/nested-aggregation.asciidoc b/docs/reference/aggregations/bucket/nested-aggregation.asciidoc new file mode 100644 index 0000000000..f5872bdc5d --- /dev/null +++ b/docs/reference/aggregations/bucket/nested-aggregation.asciidoc @@ -0,0 +1,67 @@ +[[search-aggregations-bucket-nested-aggregation]] +=== Nested Aggregation + +A special single bucket aggregation that enables aggregating nested documents. + +For example, lets say we have a index of products, and each product holds the list of resellers - each having its own +price for the product. The mapping could look like: + +[source,js] +-------------------------------------------------- +{ + ... + + "product" : { + "properties" : { + "resellers" : { <1> + "type" : "nested", + "properties" : { + "name" : { "type" : "string" }, + "price" : { "type" : "double" } + } + } + } + } +} +-------------------------------------------------- + +<1> The `resellers` is an array that holds nested documents under the `product` object. + +The following aggregations will return the minimum price products can be purchased in: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "match" : { "name" : "led tv" } + }, + "aggs" : { + "resellers" : { + "nested" : { + "path" : "resellers" + }, + "aggs" : { + "min_price" : { "min" : { "field" : "resellers.price" } } + } + } + } +} +-------------------------------------------------- + +As you can see above, the nested aggregation requires the `path` of the nested documents within the top level documents. +Then one can define any type of aggregation over these nested documents. + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "resellers": { + "min_price": { + "value" : 350 + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/bucket/range-aggregation.asciidoc b/docs/reference/aggregations/bucket/range-aggregation.asciidoc new file mode 100644 index 0000000000..f7bfcab064 --- /dev/null +++ b/docs/reference/aggregations/bucket/range-aggregation.asciidoc @@ -0,0 +1,277 @@ +[[search-aggregations-bucket-range-aggregation]] +=== Range Aggregation + +A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document. +Note that this aggregration includes the `from` value and excludes the `to` value for each range. + +Example: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "price_ranges" : { + "range" : { + "field" : "price", + "ranges" : [ + { "to" : 50 }, + { "from" : 50, "to" : 100 }, + { "from" : 100 } + ] + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "price_ranges" : { + "buckets": [ + { + "to": 50, + "doc_count": 2 + }, + { + "from": 50, + "to": 100, + "doc_count": 4 + }, + { + "from": 100, + "doc_count": 4 + } + ] + } + } +} +-------------------------------------------------- + +==== Keyed Response + +Setting the `keyed` flag to `true` will associate a unique string key with each bucket and return the ranges as a hash rather than an array: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "price_ranges" : { + "range" : { + "field" : "price", + "keyed" : true, + "ranges" : [ + { "to" : 50 }, + { "from" : 50, "to" : 100 }, + { "from" : 100 } + ] + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "price_ranges" : { + "buckets": { + "*-50.0": { + "to": 50, + "doc_count": 2 + }, + "50.0-100.0": { + "from": 50, + "to": 100, + "doc_count": 4 + }, + "100.0-*": { + "from": 100, + "doc_count": 4 + } + } + } + } +} +-------------------------------------------------- + +It is also possible to customize the key for each range: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "price_ranges" : { + "range" : { + "field" : "price", + "keyed" : true, + "ranges" : [ + { "key" : "cheap", "to" : 50 }, + { "key" : "average", "from" : 50, "to" : 100 }, + { "key" : "expensive", "from" : 100 } + ] + } + } + } +} +-------------------------------------------------- + +==== Script + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "price_ranges" : { + "range" : { + "script" : "doc['price'].value", + "ranges" : [ + { "to" : 50 }, + { "from" : 50, "to" : 100 }, + { "from" : 100 } + ] + } + } + } +} +-------------------------------------------------- + +==== Value Script + +Lets say the product prices are in USD but we would like to get the price ranges in EURO. We can use value script to convert the prices prior the aggregation (assuming conversion rate of 0.8) + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "price_ranges" : { + "range" : { + "field" : "price", + "script" : "_value * conversion_rate", + "params" : { + "conversion_rate" : 0.8 + }, + "ranges" : [ + { "to" : 35 }, + { "from" : 35, "to" : 70 }, + { "from" : 70 } + ] + } + } + } +} +-------------------------------------------------- + +==== Sub Aggregations + +The following example, not only "bucket" the documents to the different buckets but also computes statistics over the prices in each price range + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "price_ranges" : { + "range" : { + "field" : "price", + "ranges" : [ + { "to" : 50 }, + { "from" : 50, "to" : 100 }, + { "from" : 100 } + ] + }, + "aggs" : { + "price_stats" : { + "stats" : { "field" : "price" } + } + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "price_ranges" : { + "buckets": [ + { + "to": 50, + "doc_count": 2, + "price_stats": { + "count": 2, + "min": 20, + "max": 47, + "avg": 33.5, + "sum": 67 + } + }, + { + "from": 50, + "to": 100, + "doc_count": 4, + "price_stats": { + "count": 4, + "min": 60, + "max": 98, + "avg": 82.5, + "sum": 330 + } + }, + { + "from": 100, + "doc_count": 4, + "price_stats": { + "count": 4, + "min": 134, + "max": 367, + "avg": 216, + "sum": 864 + } + } + ] + } + } +} +-------------------------------------------------- + +If a sub aggregation is also based on the same value source as the range aggregation (like the `stats` aggregation in the example above) it is possible to leave out the value source definition for it. The following will return the same response as above: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "price_ranges" : { + "range" : { + "field" : "price", + "ranges" : [ + { "to" : 50 }, + { "from" : 50, "to" : 100 }, + { "from" : 100 } + ] + }, + "aggs" : { + "price_stats" : { + "stats" : {} <1> + } + } + } + } +} +-------------------------------------------------- + +<1> We don't need to specify the `price` as we "inherit" it by default from the parent `range` aggregation diff --git a/docs/reference/aggregations/bucket/reverse-nested-aggregation.asciidoc b/docs/reference/aggregations/bucket/reverse-nested-aggregation.asciidoc new file mode 100644 index 0000000000..a25fc83733 --- /dev/null +++ b/docs/reference/aggregations/bucket/reverse-nested-aggregation.asciidoc @@ -0,0 +1,118 @@ +[[search-aggregations-bucket-reverse-nested-aggregation]] +=== Reverse nested Aggregation + +A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this +aggregation can break out of the nested block structure and link to other nested structures or the root document, +which allows nesting other aggregations that aren't part of the nested object in a nested aggregation. + +The `reverse_nested` aggregation must be defined inside a `nested` aggregation. + +.Options: +* `path` - Which defines to what nested object field should be joined back. The default is empty, +which means that it joins back to the root / main document level. The path cannot contain a reference to +a nested object field that falls outside the `nested` aggregation's nested structure a `reverse_nested` is in. + +For example, lets say we have an index for a ticket system with issues and comments. The comments are inlined into +the issue documents as nested documents. The mapping could look like: + +[source,js] +-------------------------------------------------- +{ + ... + + "issue" : { + "properties" : { + "tags" : { "type" : "string" } + "comments" : { <1> + "type" : "nested" + "properties" : { + "username" : { "type" : "string", "index" : "not_analyzed" }, + "comment" : { "type" : "string" } + } + } + } + } +} +-------------------------------------------------- + +<1> The `comments` is an array that holds nested documents under the `issue` object. + +The following aggregations will return the top commenters' username that have commented and per top commenter the top +tags of the issues the user has commented on: + +[source,js] +-------------------------------------------------- +{ + "query": { + "match": { + "name": "led tv" + } + }, + "aggs": { + "comments": { + "nested": { + "path": "comments" + }, + "aggs": { + "top_usernames": { + "terms": { + "field": "comments.username" + }, + "aggs": { + "comment_to_issue": { + "reverse_nested": {}, <1> + "aggs": { + "top_tags_per_comment": { + "terms": { + "field": "tags" + } + } + } + } + } + } + } + } + } +} +-------------------------------------------------- + +As you can see above, the the `reverse_nested` aggregation is put in to a `nested` aggregation as this is the only place +in the dsl where the `reversed_nested` aggregation can be used. Its sole purpose is to join back to a parent doc higher +up in the nested structure. + +<1> A `reverse_nested` aggregation that joins back to the root / main document level, because no `path` has been defined. +Via the `path` option the `reverse_nested` aggregation can join back to a different level, if multiple layered nested +object types have been defined in the mapping + +Possible response snippet: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "comments": { + "top_usernames": { + "buckets": [ + { + "key": "username_1", + "doc_count": 12, + "comment_to_issue": { + "top_tags_per_comment": { + "buckets": [ + { + "key": "tag1", + "doc_count": 9 + }, + ... + ] + } + } + }, + ... + ] + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc b/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc new file mode 100644 index 0000000000..5ad9dbc019 --- /dev/null +++ b/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc @@ -0,0 +1,154 @@ +[[search-aggregations-bucket-sampler-aggregation]] +=== Sampler Aggregation + +experimental[] + +A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents. +Optionally, diversity settings can be used to limit the number of matches that share a common value such as an "author". + +.Example use cases: +* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches +* Removing bias from analytics by ensuring fair representation of content from different sources +* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms` + + +Example: + +[source,js] +-------------------------------------------------- +{ + "query": { + "match": { + "text": "iphone" + } + }, + "aggs": { + "sample": { + "sampler": { + "shard_size": 200, + "field" : "user.id" + }, + "aggs": { + "keywords": { + "significant_terms": { + "field": "text" + } + } + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + "aggregations": { + "sample": { + "doc_count": 1000,<1> + "keywords": {<2> + "doc_count": 1000, + "buckets": [ + ... + { + "key": "bend", + "doc_count": 58, + "score": 37.982536582524276, + "bg_count": 103 + }, + .... +} +-------------------------------------------------- + +<1> 1000 documents were sampled in total becase we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded. +<2> The results of the significant_terms aggregation are not skewed by any single over-active Twitter user because we asked for a maximum of one tweet from any one user in our sample. + + +==== shard_size + +The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard. +The default value is 100. + +=== Controlling diversity +Optionally, you can use the `field` or `script` and `max_docs_per_value` settings to control the maximum number of documents collected on any one shard which share a common value. +The choice of value (e.g. `author`) is loaded from a regular `field` or derived dynamically by a `script`. + +The aggregation will throw an error if the choice of field or script produces multiple values for a document. +It is currently not possible to offer this form of de-duplication using many values, primarily due to concerns over efficiency. + +NOTE: Any good market researcher will tell you that when working with samples of data it is important +that the sample represents a healthy variety of opinions rather than being skewed by any single voice. +The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer). + +==== Field + +Controlling diversity using a field: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sample" : { + "sampler" : { + "field" : "author", + "max_docs_per_value" : 3 + } + } + } +} +-------------------------------------------------- + +Note that the `max_docs_per_value` setting applies on a per-shard basis only for the purposes of shard-local sampling. +It is not intended as a way of providing a global de-duplication feature on search results. + + + +==== Script + +Controlling diversity using a script: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sample" : { + "sampler" : { + "script" : "doc['author'].value + '/' + doc['genre'].value" + } + } + } +} +-------------------------------------------------- +Note in the above example we chose to use the default `max_docs_per_value` setting of 1 and combine author and genre fields to ensure +each shard sample has, at most, one match for an author/genre pair. + + +==== execution_hint + +When using the settings to control diversity, the optional `execution_hint` setting can influence the management of the values used for de-duplication. +Each option will hold up to `shard_size` values in memory while performing de-duplication but the type of value held can be controlled as follows: + + - hold field values directly (`map`) + - hold ordinals of the field as determined by the Lucene index (`global_ordinals`) + - hold hashes of the field values - with potential for hash collisions (`bytes_hash`) + +The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not. +The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions. +Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints. + +=== Limitations + +==== Cannot be nested under `breadth_first` aggregations +Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. +It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores. +In this situation an error will be thrown. + +==== Limited de-dup logic. +The de-duplication logic in the diversify settings applies only at a shard level so will not apply across shards. + +==== No specialized syntax for geo/date fields +Currently the syntax for defining the diversifying values is defined by a choice of `field` or `script` - there is no added syntactical sugar for expressing geo or date units such as "1w" (1 week). +This support may be added in a later release and users will currently have to create these sorts of values using a script. \ No newline at end of file diff --git a/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc b/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc new file mode 100644 index 0000000000..1e329db1df --- /dev/null +++ b/docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc @@ -0,0 +1,524 @@ +[[search-aggregations-bucket-significantterms-aggregation]] +=== Significant Terms Aggregation + +An aggregation that returns interesting or unusual occurrences of terms in a set. + +experimental[The `significant_terms` aggregation can be very heavy when run on large indices. Work is in progress to provide more lightweight sampling techniques. As a result, the API for this feature may change in non-backwards compatible ways] + +.Example use cases: +* Suggesting "H5N1" when users search for "bird flu" in text +* Identifying the merchant that is the "common point of compromise" from the transaction history of credit card owners reporting loss +* Suggesting keywords relating to stock symbol $ATI for an automated news classifier +* Spotting the fraudulent doctor who is diagnosing more than his fair share of whiplash injuries +* Spotting the tire manufacturer who has a disproportionate number of blow-outs + +In all these cases the terms being selected are not simply the most popular terms in a set. +They are the terms that have undergone a significant change in popularity measured between a _foreground_ and _background_ set. +If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user's search results +that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency. + +==== Single-set analysis + +In the simplest case, the _foreground_ set of interest is the search results matched by a query and the _background_ +set used for statistical comparisons is the index or indices from which the results were gathered. + +Example: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "terms" : {"force" : [ "British Transport Police" ]} + }, + "aggregations" : { + "significantCrimeTypes" : { + "significant_terms" : { "field" : "crime_type" } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations" : { + "significantCrimeTypes" : { + "doc_count": 47347, + "buckets" : [ + { + "key": "Bicycle theft", + "doc_count": 3640, + "score": 0.371235374214817, + "bg_count": 66799 + } + ... + ] + } + } +} +-------------------------------------------------- + +When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force +stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554) +but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is +a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type. + +The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons. +To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces. + +This can be a tedious way to look for unusual patterns in an index + + + +==== Multi-set analysis +A simpler way to perform analysis across multiple categories is to use a parent-level aggregation to segment the data ready for analysis. + + +Example using a parent aggregation for segmentation: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "forces": { + "terms": {"field": "force"}, + "aggregations": { + "significantCrimeTypes": { + "significant_terms": {"field": "crime_type"} + } + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "forces": { + "buckets": [ + { + "key": "Metropolitan Police Service", + "doc_count": 894038, + "significantCrimeTypes": { + "doc_count": 894038, + "buckets": [ + { + "key": "Robbery", + "doc_count": 27617, + "score": 0.0599, + "bg_count": 53182 + }, + ... + ] + } + }, + { + "key": "British Transport Police", + "doc_count": 47347, + "significantCrimeTypes": { + "doc_count": 47347, + "buckets": [ + { + "key": "Bicycle theft", + "doc_count": 3640, + "score": 0.371, + "bg_count": 66799 + }, + ... + ] + } + } + ] + } +} + +-------------------------------------------------- + +Now we have anomaly detection for each of the police forces using a single request. + +We can use other forms of top-level aggregations to segment our data, for example segmenting by geographic +area to identify unusual hot-spots of a particular crime type: + +[source,js] +-------------------------------------------------- +{ + "aggs": { + "hotspots": { + "geohash_grid" : { + "field":"location", + "precision":5, + }, + "aggs": { + "significantCrimeTypes": { + "significant_terms": {"field": "crime_type"} + } + } + } + } +} +-------------------------------------------------- + +This example uses the `geohash_grid` aggregation to create result buckets that represent geographic areas, and inside each +bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g. + +* Airports exhibit unusual numbers of weapon confiscations +* Universities show uplifts of bicycle thefts + +At a higher geohash_grid zoom-level with larger coverage areas we would start to see where an entire police-force may be +tackling an unusual volume of a particular crime type. + + +Obviously a time-based top-level segmentation would help identify current trends for each point in time +where a simple `terms` aggregation would typically show the very popular "constants" that persist across all time slots. + + + +.How are the scores calculated? +********************************** +The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in _foreground_ and _background_ sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section. + +********************************** + + +==== Use on free-text fields + +The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest: + +* keywords for refining end-user searches +* keywords for use in percolator queries + +WARNING: Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt +to load every unique word into RAM. It is recommended to only use this on smaller indices. + +.Use the _"like this but not this"_ pattern +********************************** +You can spot mis-categorized content by first searching a structured field e.g. `category:adultMovie` and use significant_terms on the +free-text "movie_description" field. Take the suggested words (I'll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords. +You now have a ranked list of badly-categorized movies that you should reclassify or at least remove from the "familyFriendly" category. + +The significance score from each term can also provide a useful `boost` setting to sort matches. +Using the `minimum_should_match` setting of the `terms` query with the keywords will help control the balance of precision/recall in the result set i.e +a high setting would have a small number of relevant results packed full of keywords and a setting of "1" would produce a more exhaustive results set with all documents containing _any_ keyword. + +********************************** + +[TIP] +============ +.Show significant_terms in context + +Free-text significant_terms are much more easily understood when viewed in context. Take the results of `significant_terms` suggestions from a +free-text field and use them in a `terms` query on the same field with a `highlight` clause to present users with example snippets of documents. When the terms +are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent. +============ + +==== Custom background sets + +Ordinarily, the foreground set of documents is "diffed" against a background set of all the documents in your index. +However, sometimes it may prove useful to use a narrower background set as the basis for comparisons. +For example, a query on documents relating to "Madrid" in an index with content from all over the world might reveal that "Spanish" +was a significant term. This may be true but if you want some more focused terms you could use a `background_filter` +on the term 'spain' to establish a narrower set of documents as context. With this as a background "Spanish" would now +be seen as commonplace and therefore not as significant as words like "capital" that relate more strongly with Madrid. +Note that using a background filter will slow things down - each term's background frequency must now be derived on-the-fly from filtering posting lists rather than reading the index's pre-computed count for a term. + +==== Limitations + +===== Significant terms must be indexed values +Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes. +Because of the way the significant_terms aggregation must consider both _foreground_ and _background_ frequencies +it would be prohibitively expensive to use a script on the entire index to obtain background frequencies for comparisons. +Also DocValues are not supported as sources of term data for similar reasons. + +===== No analysis of floating point fields +Floating point fields are currently not supported as the subject of significant_terms analysis. +While integer or long fields can be used to represent concepts like bank account numbers or category numbers which +can be interesting to track, floating point fields are usually used to represent quantities of something. +As such, individual floating point terms are not useful for this form of frequency analysis. + +===== Use as a parent aggregation +If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the +top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and +so there is no difference in document frequencies to observe and from which to make sensible suggestions. + +Another consideration is that the significant_terms aggregation produces many candidate results at shard level +that are only later pruned on the reducing node once all statistics from all shards are merged. As a result, +it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms +aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of +significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations. + +===== Approximate counts +The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and +as such may be: + +* low if certain shards did not provide figures for a given term in their top sample +* high when considering the background frequency as it may count occurrences found in deleted documents + +Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies. +However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels. + +==== Parameters + +===== JLH score + +The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall. + +===== mutual information +Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter + +[source,js] +-------------------------------------------------- + + "mutual_information": { + "include_negatives": true + } +-------------------------------------------------- + +Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, `include_negatives` can be set to `false`. + +Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set + +[source,js] +-------------------------------------------------- + +"background_is_superset": false +-------------------------------------------------- + + +===== Chi square +Chi square as described in "Information Retrieval", Manning et al., Chapter 13.5.2 can be used as significance score by adding the parameter + +[source,js] +-------------------------------------------------- + + "chi_square": { + } +-------------------------------------------------- + +Chi square behaves like mutual information and can be configured with the same parameters `include_negatives` and `background_is_superset`. + + +===== google normalized distance +Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (http://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter + +[source,js] +-------------------------------------------------- + + "gnd": { + } +-------------------------------------------------- + +`gnd` also accepts the `background_is_superset` parameter. + + +===== Percentage +A simple calculation of the number of documents in the foreground sample with a term divided by the number of documents in the background with the term. +By default this produces a score greater than zero and less than one. + +The benefit of this heuristic is that the scoring logic is simple to explain to anyone familiar with a "per capita" statistic. However, for fields with high cardinality there is a tendency for this heuristic to select the rarest terms such as typos that occur only once because they score 1/1 = 100%. + +It would be hard for a seasoned boxer to win a championship if the prize was awarded purely on the basis of percentage of fights won - by these rules a newcomer with only one fight under his belt would be impossible to beat. +Multiple observations are typically required to reinforce a view so it is recommended in these cases to set both `min_doc_count` and `shard_min_doc_count` to a higher value such as 10 in order to filter out the low-frequency terms that otherwise take precedence. + +[source,js] +-------------------------------------------------- + + "percentage": { + } +-------------------------------------------------- + + +===== Which one is best? + + +Roughly, `mutual_information` prefers high frequent terms even if they occur also frequently in the background. For example, in an analysis of natural language text this might lead to selection of stop words. `mutual_information` is unlikely to select very rare terms like misspellings. `gnd` prefers terms with a high co-occurrence and avoids selection of stopwords. It might be better suited for synonym detection. However, `gnd` has a tendency to select very rare terms that are, for example, a result of misspelling. `chi_square` and `jlh` are somewhat in-between. + +It is hard to say which one of the different heuristics will be the best choice as it depends on what the significant terms are used for (see for example [Yang and Pedersen, "A Comparative Study on Feature Selection in Text Categorization", 1997](http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf) for a study on using significant terms for feature selection for text classification). + +If none of the above measures suits your usecase than another option is to implement a custom significance measure: + +===== scripted +Customized scores can be implemented via a script: + +[source,js] +-------------------------------------------------- + + "script_heuristic": { + "script": "_subset_freq/(_superset_freq - _subset_freq + 1)" + } +-------------------------------------------------- + +Scripts can be inline (as in above example), indexed or stored on disk. For details on the options, see <>. +Parameters need to be set as follows: + +[horizontal] +`script`:: Inline script, name of script file or name of indexed script. Mandatory. +`script_type`:: One of "inline" (default), "indexed" or "file". +`lang`:: Script language (default "groovy") +`params`:: Script parameters (default empty). + +Available parameters in the script are + +[horizontal] +`_subset_freq`:: Number of documents the term appears in in the subset. +`_superset_freq`:: Number of documents the term appears in in the superset. +`_subset_size`:: Number of documents in the subset. +`_superset_size`:: Number of documents in the superset. + +===== Size & Shard Size + +The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By +default, the node coordinating the search process will request each shard to provide its own top term buckets +and once all shards respond, it will reduce the results to the final list that will then be returned to the client. +If the number of unique terms is greater than `size`, the returned list can be slightly off and not accurate +(it could be that the term counts are slightly off and it could even be that a term that should have been in the top +size buckets was not returned). + +If set to `0`, the `size` will be set to `Integer.MAX_VALUE`. + +To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard +using a heuristic based on the number of shards. To take manual control of this setting the `shard_size` parameter +can be used to control the volumes of candidate terms produced by each shard. + +Low-frequency terms can turn out to be the most interesting ones once all results are combined so the +significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to +values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given +a consolidated review by the reducing node before the final selection. Obviously large candidate term lists +will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter. + + +If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`. + + +NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will + override it and reset it to be equal to `size`. + +===== Minimum document count + +It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "tags" : { + "significant_terms" : { + "field" : "tag", + "min_doc_count": 10 + } + } + } +} +-------------------------------------------------- + +The above aggregation would only return tags which have been found in 10 hits or more. Default value is `3`. + + + + +Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic. + +`shard_min_doc_count` parameter + +The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it. + + + + +WARNING: Setting `min_doc_count` to `1` is generally not advised as it tends to return terms that + are typos or other bizarre curiosities. Finding more than one instance of a term helps + reinforce that, while still rare, the term was not the result of a one-off accident. The + default value of 3 is used to provide a minimum weight-of-evidence. + Setting `shard_min_doc_count` too high will cause significant candidate terms to be filtered out on a shard level. This value should be set much lower than `min_doc_count/#shards`. + + + +===== Custom background context + +The default source of statistical information for background term frequencies is the entire index and this +scope can be narrowed through the use of a `background_filter` to focus in on significant terms within a narrower +context: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "match" : "madrid" + }, + "aggs" : { + "tags" : { + "significant_terms" : { + "field" : "tag", + "background_filter": { + "term" : { "text" : "spain"} + } + } + } + } +} +-------------------------------------------------- + +The above filter would help focus in on terms that were peculiar to the city of Madrid rather than revealing +terms like "Spanish" that are unusual in the full index's worldwide context but commonplace in the subset of documents containing the +word "Spain". + +WARNING: Use of background filters will slow the query as each term's postings must be filtered to determine a frequency + + +===== Filtering Values + +It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the `include` and +`exclude` parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features +described in the <> documentation. + + +===== Execution hint + + +There are different mechanisms by which terms aggregations can be executed: + + - by using field values directly in order to aggregate data per-bucket (`map`) + - by using ordinals of the field and preemptively allocating one bucket per ordinal value (`global_ordinals`) + - by using ordinals of the field and dynamically allocating one bucket per ordinal value (`global_ordinals_hash`) + +Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured. + +`map` should only be considered when very few documents match a query. Otherwise the ordinals-based execution modes +are significantly faster. By default, `map` is only used when running an aggregation on scripts, since they don't have +ordinals. + +`global_ordinals` is the second fastest option, but the fact that it preemptively allocates buckets can be memory-intensive, +especially if you have one or more sub aggregations. It is used by default on top-level terms aggregations. + +`global_ordinals_hash` on the contrary to `global_ordinals` and `global_ordinals_low_cardinality` allocates buckets dynamically +so memory usage is linear to the number of values of the documents that are part of the aggregation scope. It is used by default +in inner aggregations. + + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "tags" : { + "significant_terms" : { + "field" : "tags", + "execution_hint": "map" <1> + } + } + } +} +-------------------------------------------------- + +<1> the possible values are `map`, `global_ordinals` and `global_ordinals_hash` + +Please note that Elasticsearch will ignore this execution hint if it is not applicable. + diff --git a/docs/reference/aggregations/bucket/terms-aggregation.asciidoc b/docs/reference/aggregations/bucket/terms-aggregation.asciidoc new file mode 100644 index 0000000000..58a6ca2449 --- /dev/null +++ b/docs/reference/aggregations/bucket/terms-aggregation.asciidoc @@ -0,0 +1,657 @@ +[[search-aggregations-bucket-terms-aggregation]] +=== Terms Aggregation + +A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value. + +Example: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "genders" : { + "terms" : { "field" : "gender" } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations" : { + "genders" : { + "doc_count_error_upper_bound": 0, <1> + "sum_other_doc_count": 0, <2> + "buckets" : [ <3> + { + "key" : "male", + "doc_count" : 10 + }, + { + "key" : "female", + "doc_count" : 10 + }, + ] + } + } +} +-------------------------------------------------- + +<1> an upper bound of the error on the document counts for each term, see <> +<2> when there are lots of unique terms, elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response +<3> the list of the top buckets, the meaning of `top` being defined by the <> + +By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can +change this default behaviour by setting the `size` parameter. + +==== Size + +The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By +default, the node coordinating the search process will request each shard to provide its own top `size` term buckets +and once all shards respond, it will reduce the results to the final list that will then be returned to the client. +This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate +(it could be that the term counts are slightly off and it could even be that a term that should have been in the top +size buckets was not returned). If set to `0`, the `size` will be set to `Integer.MAX_VALUE`. + +[[search-aggregations-bucket-terms-aggregation-approximate-counts]] +==== Document counts are approximate + +As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always +accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are +combined to give a final view. Consider the following scenario: + +A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with +3 shards. In this case each shard is asked to give its top 5 terms. + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "products" : { + "terms" : { + "field" : "product", + "size" : 5 + } + } + } +} +-------------------------------------------------- + +The terms for each of the three shards are shown below with their +respective document counts in brackets: + +[width="100%",cols="^2,^2,^2,^2",options="header"] +|========================================================= +| | Shard A | Shard B | Shard C + +| 1 | Product A (25) | Product A (30) | Product A (45) +| 2 | Product B (18) | Product B (25) | Product C (44) +| 3 | Product C (6) | Product F (17) | Product Z (36) +| 4 | Product D (3) | Product Z (16) | Product G (30) +| 5 | Product E (2) | Product G (15) | Product E (29) +| 6 | Product F (2) | Product H (14) | Product H (28) +| 7 | Product G (2) | Product I (10) | Product Q (2) +| 8 | Product H (2) | Product Q (6) | Product D (1) +| 9 | Product I (1) | Product J (8) | +| 10 | Product J (1) | Product C (4) | + +|========================================================= + +The shards will return their top 5 terms so the results from the shards will be: + + +[width="100%",cols="^2,^2,^2,^2",options="header"] +|========================================================= +| | Shard A | Shard B | Shard C + +| 1 | Product A (25) | Product A (30) | Product A (45) +| 2 | Product B (18) | Product B (25) | Product C (44) +| 3 | Product C (6) | Product F (17) | Product Z (36) +| 4 | Product D (3) | Product Z (16) | Product G (30) +| 5 | Product E (2) | Product G (15) | Product E (29) + +|========================================================= + +Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces +the following: + +[width="40%",cols="^2,^2"] +|========================================================= + +| 1 | Product A (100) +| 2 | Product Z (52) +| 3 | Product C (50) +| 4 | Product G (45) +| 5 | Product B (43) + +|========================================================= + +Because Product A was returned from all shards we know that its document count value is accurate. Product C was only +returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on +shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also +returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of +combining the results to produce the final list of terms, that there is an error in the document count for Product C and +not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of +terms because it did not make it into the top five terms on any of the shards. + +==== Shard Size + +The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to +compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data +transfers between the nodes and the client). + +The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined, +it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the +coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way, +one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to +the client. If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`. + + +NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will + override it and reset it to be equal to `size`. + +It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this +on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network. + +The default `shard_size` is a multiple of the `size` parameter which is dependant on the number of shards. + +==== Calculating Document Count Error + +There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as +a whole which represents the maximum potential document count for a term which did not make it into the final list of +terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example +given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned +could have the 4th highest document count. + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations" : { + "products" : { + "doc_count_error_upper_bound" : 46, + "buckets" : [ + { + "key" : "Product A", + "doc_count" : 100 + }, + { + "key" : "Product Z", + "doc_count" : 52 + }, + ... + ] + } + } +} +-------------------------------------------------- + +==== Per bucket document count error + +experimental[] + +The second error value can be enabled by setting the `show_term_doc_count_error` parameter to true. This shows an error value +for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when +deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for the last term returned +by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as +Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document +count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by +15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count +returned is accurate. + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations" : { + "products" : { + "doc_count_error_upper_bound" : 46, + "buckets" : [ + { + "key" : "Product A", + "doc_count" : 100, + "doc_count_error_upper_bound" : 0 + }, + { + "key" : "Product Z", + "doc_count" : 52, + "doc_count_error_upper_bound" : 2 + }, + ... + ] + } + } +} +-------------------------------------------------- + +These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is +ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard +does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the +aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be +determined and is given a value of -1 to indicate this. + +[[search-aggregations-bucket-terms-aggregation-order]] +==== Order + +The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by +their `doc_count` descending. It is also possible to change this behaviour as follows: + +Ordering the buckets by their `doc_count` in an ascending manner: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "genders" : { + "terms" : { + "field" : "gender", + "order" : { "_count" : "asc" } + } + } + } +} +-------------------------------------------------- + +Ordering the buckets alphabetically by their terms in an ascending manner: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "genders" : { + "terms" : { + "field" : "gender", + "order" : { "_term" : "asc" } + } + } + } +} +-------------------------------------------------- + + +Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name): + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "genders" : { + "terms" : { + "field" : "gender", + "order" : { "avg_height" : "desc" } + }, + "aggs" : { + "avg_height" : { "avg" : { "field" : "height" } } + } + } + } +} +-------------------------------------------------- + +Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name): + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "genders" : { + "terms" : { + "field" : "gender", + "order" : { "height_stats.avg" : "desc" } + }, + "aggs" : { + "height_stats" : { "stats" : { "field" : "height" } } + } + } + } +} +-------------------------------------------------- + +It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long +as the aggregations path are of a single-bucket type, where the last aggregation in the path may either be a single-bucket +one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`), +in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of +a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value). + +The path must be defined in the following form: + +-------------------------------------------------- +AGG_SEPARATOR := '>' +METRIC_SEPARATOR := '.' +AGG_NAME := +METRIC := +PATH := []*[] +-------------------------------------------------- + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "countries" : { + "terms" : { + "field" : "address.country", + "order" : { "females>height_stats.avg" : "desc" } + }, + "aggs" : { + "females" : { + "filter" : { "term" : { "gender" : "female" }}, + "aggs" : { + "height_stats" : { "stats" : { "field" : "height" }} + } + } + } + } + } +} +-------------------------------------------------- + +The above will sort the countries buckets based on the average height among the female population. + +Multiple criteria can be used to order the buckets by providing an array of order criteria such as the following: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "countries" : { + "terms" : { + "field" : "address.country", + "order" : [ { "females>height_stats.avg" : "desc" }, { "_count" : "desc" } ] + }, + "aggs" : { + "females" : { + "filter" : { "term" : { "gender" : { "female" }}}, + "aggs" : { + "height_stats" : { "stats" : { "field" : "height" }} + } + } + } + } + } +} +-------------------------------------------------- + +The above will sort the countries buckets based on the average height among the female population and then by +their `doc_count` in descending order. + +NOTE: In the event that two buckets share the same values for all order criteria the bucket's term value is used as a +tie-breaker in ascending alphabetical order to prevent non-deterministic ordering of buckets. + +==== Minimum document count + +It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "tags" : { + "terms" : { + "field" : "tags", + "min_doc_count": 10 + } + } + } +} +-------------------------------------------------- + +The above aggregation would only return tags which have been found in 10 hits or more. Default value is `1`. + + +Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic. + +`shard_min_doc_count` parameter + +The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it. + + + +NOTE: Setting `min_doc_count`=`0` will also return buckets for terms that didn't match any hit. However, some of + the returned terms which have a document count of zero might only belong to deleted documents or documents + from other types, so there is no warranty that a `match_all` query would find a positive document count for + those terms. + +WARNING: When NOT sorting on `doc_count` descending, high values of `min_doc_count` may return a number of buckets + which is less than `size` because not enough data was gathered from the shards. Missing buckets can be + back by increasing `shard_size`. + Setting `shard_min_doc_count` too high will cause terms to be filtered out on a shard level. This value should be set much lower than `min_doc_count/#shards`. + +[[search-aggregations-bucket-terms-aggregation-script]] +==== Script + +Generating the terms using a script: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "genders" : { + "terms" : { + "script" : "doc['gender'].value" + } + } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + + +==== Value Script + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "genders" : { + "terms" : { + "field" : "gender", + "script" : "'Gender: ' +_value" + } + } + } +} +-------------------------------------------------- + + +==== Filtering Values + +It is possible to filter the values for which buckets will be created. This can be done using the `include` and +`exclude` parameters which are based on regular expression strings or arrays of exact values. + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "tags" : { + "terms" : { + "field" : "tags", + "include" : ".*sport.*", + "exclude" : "water_.*" + } + } + } +} +-------------------------------------------------- + +In the above example, buckets will be created for all the tags that has the word `sport` in them, except those starting +with `water_` (so the tag `water_sports` will no be aggregated). The `include` regular expression will determine what +values are "allowed" to be aggregated, while the `exclude` determines the values that should not be aggregated. When +both are defined, the `exclude` has precedence, meaning, the `include` is evaluated first and only then the `exclude`. + +The syntax is the same as <>. + +For matching based on exact values the `include` and `exclude` parameters can simply take an array of +strings that represent the terms as they are found in the index: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "JapaneseCars" : { + "terms" : { + "field" : "make", + "include" : ["mazda", "honda"] + } + }, + "ActiveCarManufacturers" : { + "terms" : { + "field" : "make", + "exclude" : ["rover", "jensen"] + } + } + } +} +-------------------------------------------------- + +==== Multi-field terms aggregation + +The `terms` aggregation does not support collecting terms from multiple fields +in the same document. The reason is that the `terms` agg doesn't collect the +string term values themselves, but rather uses +<> +to produce a list of all of the unique values in the field. Global ordinals +results in an important performance boost which would not be possible across +multiple fields. + +There are two approaches that you can use to perform a `terms` agg across +multiple fields: + +<>:: + +Use a script to retrieve terms from multiple fields. This disables the global +ordinals optimization and will be slower than collecting terms from a single +field, but it gives you the flexibility to implement this option at search +time. + +<>:: + +If you know ahead of time that you want to collect the terms from two or more +fields, then use `copy_to` in your mapping to create a new dedicated field at +index time which contains the values from both fields. You can aggregate on +this single field, which will benefit from the global ordinals optimization. + +==== Collect mode + +Deferring calculation of child aggregations + +For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation +of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree +are expanded in one depth-first pass and only then any pruning occurs. In some rare scenarios this can be very wasteful and can hit memory constraints. +An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "actors" : { + "terms" : { + "field" : "actors", + "size" : 10 + }, + "aggs" : { + "costars" : { + "terms" : { + "field" : "actors", + "size" : 5 + } + } + } + } + } +} +-------------------------------------------------- + +Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets +during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine +the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the `breadth_first` collection +mode as opposed to the default `depth_first` mode: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "actors" : { + "terms" : { + "field" : "actors", + "size" : 10, + "collect_mode" : "breadth_first" + }, + "aggs" : { + "costars" : { + "terms" : { + "field" : "actors", + "size" : 5 + } + } + } + } + } +} +-------------------------------------------------- + + +When using `breadth_first` mode the set of documents that fall into the uppermost buckets are +cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents. +In most requests the volume of buckets generated is smaller than the number of documents that fall into them so the default `depth_first` +collection mode is normally the best bet but occasionally the `breadth_first` strategy can be significantly more efficient. Currently +elasticsearch will always use the `depth_first` collect_mode unless explicitly instructed to use `breadth_first` as in the above example. +Note that the `order` parameter can still be used to refer to data from a child aggregation when using the `breadth_first` setting - the parent +aggregation understands that this child aggregation will need to be called first before any of the other child aggregations. + +WARNING: It is not possible to nest aggregations such as `top_hits` which require access to match score information under an aggregation that uses +the `breadth_first` collection mode. This is because this would require a RAM buffer to hold the float score value for every document and +this would typically be too costly in terms of RAM. + +[[search-aggregations-bucket-terms-aggregation-execution-hint]] +==== Execution hint + +experimental[The automated execution optimization is experimental, so this parameter is provided temporarily as a way to override the default behaviour] + +There are different mechanisms by which terms aggregations can be executed: + + - by using field values directly in order to aggregate data per-bucket (`map`) + - by using ordinals of the field and preemptively allocating one bucket per ordinal value (`global_ordinals`) + - by using ordinals of the field and dynamically allocating one bucket per ordinal value (`global_ordinals_hash`) + - by using per-segment ordinals to compute counts and remap these counts to global counts using global ordinals (`global_ordinals_low_cardinality`) + +Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured. + +`map` should only be considered when very few documents match a query. Otherwise the ordinals-based execution modes +are significantly faster. By default, `map` is only used when running an aggregation on scripts, since they don't have +ordinals. + +`global_ordinals_low_cardinality` only works for leaf terms aggregations but is usually the fastest execution mode. Memory +usage is linear with the number of unique values in the field, so it is only enabled by default on low-cardinality fields. + +`global_ordinals` is the second fastest option, but the fact that it preemptively allocates buckets can be memory-intensive, +especially if you have one or more sub aggregations. It is used by default on top-level terms aggregations. + +`global_ordinals_hash` on the contrary to `global_ordinals` and `global_ordinals_low_cardinality` allocates buckets dynamically +so memory usage is linear to the number of values of the documents that are part of the aggregation scope. It is used by default +in inner aggregations. + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "tags" : { + "terms" : { + "field" : "tags", + "execution_hint": "map" <1> + } + } + } +} +-------------------------------------------------- + +<1> experimental[] the possible values are `map`, `global_ordinals`, `global_ordinals_hash` and `global_ordinals_low_cardinality` + +Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints. diff --git a/docs/reference/aggregations/metrics.asciidoc b/docs/reference/aggregations/metrics.asciidoc new file mode 100644 index 0000000000..f80c36f2eb --- /dev/null +++ b/docs/reference/aggregations/metrics.asciidoc @@ -0,0 +1,48 @@ +[[search-aggregations-metrics]] +== Metrics Aggregations + +The aggregations in this family compute metrics based on values extracted in one way or another from the documents that +are being aggregated. The values are typically extracted from the fields of the document (using the field data), but +can also be generated using scripts. + +Numeric metrics aggregations are a special type of metrics aggregation which output numeric values. Some aggregations output +a single numeric metric (e.g. `avg`) and are called `single-value numeric metrics aggregation`, others generate multiple +metrics (e.g. `stats`) and are called `multi-value numeric metrics aggregation`. The distinction between single-value and +multi-value numeric metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some +bucket aggregations (some bucket aggregations enable you to sort the returned buckets based on the numeric metrics in each bucket). + +include::metrics/avg-aggregation.asciidoc[] + +include::metrics/cardinality-aggregation.asciidoc[] + +include::metrics/extendedstats-aggregation.asciidoc[] + +include::metrics/geobounds-aggregation.asciidoc[] + +include::metrics/max-aggregation.asciidoc[] + +include::metrics/min-aggregation.asciidoc[] + +include::metrics/percentile-aggregation.asciidoc[] + +include::metrics/percentile-rank-aggregation.asciidoc[] + +include::metrics/scripted-metric-aggregation.asciidoc[] + +include::metrics/stats-aggregation.asciidoc[] + +include::metrics/sum-aggregation.asciidoc[] + +include::metrics/tophits-aggregation.asciidoc[] + +include::metrics/valuecount-aggregation.asciidoc[] + + + + + + + + + + diff --git a/docs/reference/aggregations/metrics/avg-aggregation.asciidoc b/docs/reference/aggregations/metrics/avg-aggregation.asciidoc new file mode 100644 index 0000000000..3f029984ba --- /dev/null +++ b/docs/reference/aggregations/metrics/avg-aggregation.asciidoc @@ -0,0 +1,75 @@ +[[search-aggregations-metrics-avg-aggregation]] +=== Avg Aggregation + +A `single-value` metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. + +Assuming the data consists of documents representing exams grades (between 0 and 100) of students + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "avg_grade" : { "avg" : { "field" : "grade" } } + } +} +-------------------------------------------------- + +The above aggregation computes the average grade over all documents. The aggregation type is `avg` and the `field` setting defines the numeric field of the documents the average will be computed on. The above will return the following: + + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "avg_grade": { + "value": 75 + } + } +} +-------------------------------------------------- + +The name of the aggregation (`avg_grade` above) also serves as the key by which the aggregation result can be retrieved from the returned response. + +==== Script + +Computing the average grade based on a script: + +[source,js] +-------------------------------------------------- +{ + ..., + + "aggs" : { + "avg_grade" : { "avg" : { "script" : "doc['grade'].value" } } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +===== Value Script + +It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new average: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + ... + + "aggs" : { + "avg_corrected_grade" : { + "avg" : { + "field" : "grade", + "script" : "_value * correction", + "params" : { + "correction" : 1.2 + } + } + } + } + } +} +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/metrics/cardinality-aggregation.asciidoc b/docs/reference/aggregations/metrics/cardinality-aggregation.asciidoc new file mode 100644 index 0000000000..07943a06c2 --- /dev/null +++ b/docs/reference/aggregations/metrics/cardinality-aggregation.asciidoc @@ -0,0 +1,157 @@ +[[search-aggregations-metrics-cardinality-aggregation]] +=== Cardinality Aggregation + +A `single-value` metrics aggregation that calculates an approximate count of +distinct values. Values can be extracted either from specific fields in the +document or generated by a script. + +Assume you are indexing books and would like to count the unique authors that +match a query: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "author_count" : { + "cardinality" : { + "field" : "author" + } + } + } +} +-------------------------------------------------- + +==== Precision control + +This aggregation also supports the `precision_threshold` and `rehash` options: + +experimental[The `precision_threshold` and `rehash` options are specific to the current internal implementation of the `cardinality` agg, which may change in the future] + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "author_count" : { + "cardinality" : { + "field" : "author_hash", + "precision_threshold": 100, <1> + "rehash": false <2> + } + } + } +} +-------------------------------------------------- + +<1> The `precision_threshold` options allows to trade memory for accuracy, and +defines a unique count below which counts are expected to be close to +accurate. Above this value, counts might become a bit more fuzzy. The maximum +supported value is 40000, thresholds above this number will have the same +effect as a threshold of 40000. +Default value depends on the number of parent aggregations that multiple +create buckets (such as terms or histograms). +<2> If you computed a hash on client-side, stored it into your documents and want +Elasticsearch to use them to compute counts using this hash function without +rehashing values, it is possible to specify `rehash: false`. Default value is +`true`. Please note that the hash must be indexed as a long when `rehash` is +false. + +==== Counts are approximate + +Computing exact counts requires loading values into a hash set and returning its +size. This doesn't scale when working on high-cardinality sets and/or large +values as the required memory usage and the need to communicate those +per-shard sets between nodes would utilize too many resources of the cluster. + +This `cardinality` aggregation is based on the +http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++] +algorithm, which counts based on the hashes of the values with some interesting +properties: + + * configurable precision, which decides on how to trade memory for accuracy, + * excellent accuracy on low-cardinality sets, + * fixed memory usage: no matter if there are tens or billions of unique values, + memory usage only depends on the configured precision. + +For a precision threshold of `c`, the implementation that we are using requires +about `c * 8` bytes. + +The following chart shows how the error varies before and after the threshold: + +image:images/cardinality_error.png[] + +For all 3 thresholds, counts have been accurate up to the configured threshold +(although not guaranteed, this is likely to be the case). Please also note that +even with a threshold as low as 100, the error remains under 5%, even when +counting millions of items. + +==== Pre-computed hashes + +If you don't want Elasticsearch to re-compute hashes on every run of this +aggregation, it is possible to use pre-computed hashes, either by computing a +hash on client-side, indexing it and specifying `rehash: false`, or by using +the special `murmur3` field mapper, typically in the context of a `multi-field` +in the mapping: + +[source,js] +-------------------------------------------------- +{ + "author": { + "type": "string", + "fields": { + "hash": { + "type": "murmur3" + } + } + } +} +-------------------------------------------------- + +With such a mapping, Elasticsearch is going to compute hashes of the `author` +field at indexing time and store them in the `author.hash` field. This +way, unique counts can be computed using the cardinality aggregation by only +loading the hashes into memory, not the values of the `author` field, and +without computing hashes on the fly: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "author_count" : { + "cardinality" : { + "field" : "author.hash" + } + } + } +} +-------------------------------------------------- + +NOTE: `rehash` is automatically set to `false` when computing unique counts on +a `murmur3` field. + +NOTE: Pre-computing hashes is usually only useful on very large and/or +high-cardinality fields as it saves CPU and memory. However, on numeric +fields, hashing is very fast and storing the original values requires as much +or less memory than storing the hashes. This is also true on low-cardinality +string fields, especially given that those have an optimization in order to +make sure that hashes are computed at most once per unique value per segment. + +==== Script + +The `cardinality` metric supports scripting, with a noticeable performance hit +however since hashes need to be computed on the fly. + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "author_count" : { + "cardinality" : { + "script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value" + } + } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + diff --git a/docs/reference/aggregations/metrics/extendedstats-aggregation.asciidoc b/docs/reference/aggregations/metrics/extendedstats-aggregation.asciidoc new file mode 100644 index 0000000000..07d25fac65 --- /dev/null +++ b/docs/reference/aggregations/metrics/extendedstats-aggregation.asciidoc @@ -0,0 +1,119 @@ +[[search-aggregations-metrics-extendedstats-aggregation]] +=== Extended Stats Aggregation + +A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. + +The `extended_stats` aggregations is an extended version of the <> aggregation, where additional metrics are added such as `sum_of_squares`, `variance`, `std_deviation` and `std_deviation_bounds`. + +Assuming the data consists of documents representing exams grades (between 0 and 100) of students + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "grades_stats" : { "extended_stats" : { "field" : "grade" } } + } +} +-------------------------------------------------- + +The above aggregation computes the grades statistics over all documents. The aggregation type is `extended_stats` and the `field` setting defines the numeric field of the documents the stats will be computed on. The above will return the following: + + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "grade_stats": { + "count": 9, + "min": 72, + "max": 99, + "avg": 86, + "sum": 774, + "sum_of_squares": 67028, + "variance": 51.55555555555556, + "std_deviation": 7.180219742846005, + "std_deviation_bounds": { + "upper": 100.36043948569201, + "lower": 71.63956051430799 + } + } + } +} +-------------------------------------------------- + +The name of the aggregation (`grades_stats` above) also serves as the key by which the aggregation result can be retrieved from the returned response. + +==== Standard Deviation Bounds +By default, the `extended_stats` metric will return an object called `std_deviation_bounds`, which provides an interval of plus/minus two standard +deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example +three standard deviations, you can set `sigma` in the request: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "grades_stats" : { + "extended_stats" : { + "field" : "grade", + "sigma" : 3 <1> + } + } + } +} +-------------------------------------------------- +<1> `sigma` controls how many standard deviations +/- from the mean should be displayed + +`sigma` can be any non-negative double, meaning you can request non-integer values such as `1.5`. A value of `0` is valid, but will simply +return the average for both `upper` and `lower` bounds. + +.Standard Deviation and Bounds require normality +[NOTE] +===== +The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must +be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so +if your data is skewed heavily left or right, the value returned will be misleading. +===== + +==== Script + +Computing the grades stats based on a script: + +[source,js] +-------------------------------------------------- +{ + ..., + + "aggs" : { + "grades_stats" : { "extended_stats" : { "script" : "doc['grade'].value" } } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +===== Value Script + +It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new stats: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + ... + + "aggs" : { + "grades_stats" : { + "extended_stats" : { + "field" : "grade", + "script" : "_value * correction", + "params" : { + "correction" : 1.2 + } + } + } + } + } +} +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/metrics/geobounds-aggregation.asciidoc b/docs/reference/aggregations/metrics/geobounds-aggregation.asciidoc new file mode 100644 index 0000000000..ade59477ee --- /dev/null +++ b/docs/reference/aggregations/metrics/geobounds-aggregation.asciidoc @@ -0,0 +1,53 @@ +[[search-aggregations-metrics-geobounds-aggregation]] +=== Geo Bounds Aggregation + +A metric aggregation that computes the bounding box containing all geo_point values for a field. + + +Example: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "match" : { "business_type" : "shop" } + }, + "aggs" : { + "viewport" : { + "geo_bounds" : { + "field" : "location", <1> + "wrap_longitude" : true <2> + } + } + } +} +-------------------------------------------------- + +<1> The `geo_bounds` aggregation specifies the field to use to obtain the bounds +<2> `wrap_longitude` is an optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value is `true` + +The above aggregation demonstrates how one would compute the bounding box of the location field for all documents with a business type of shop + +The response for the above aggregation: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "viewport": { + "bounds": { + "top_left": { + "lat": 80.45, + "lon": -160.22 + }, + "bottom_right": { + "lat": 40.65, + "lon": 42.57 + } + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/metrics/max-aggregation.asciidoc b/docs/reference/aggregations/metrics/max-aggregation.asciidoc new file mode 100644 index 0000000000..facefc1201 --- /dev/null +++ b/docs/reference/aggregations/metrics/max-aggregation.asciidoc @@ -0,0 +1,69 @@ +[[search-aggregations-metrics-max-aggregation]] +=== Max Aggregation + +A `single-value` metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. + +Computing the max price value across all documents + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "max_price" : { "max" : { "field" : "price" } } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "max_price": { + "value": 35 + } + } +} +-------------------------------------------------- + +As can be seen, the name of the aggregation (`max_price` above) also serves as the key by which the aggregation result can be retrieved from the returned response. + +==== Script + +Computing the max price value across all document, this time using a script: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "max_price" : { "max" : { "script" : "doc['price'].value" } } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +==== Value Script + +Let's say that the prices of the documents in our index are in USD, but we would like to compute the max in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "max_price_in_euros" : { + "max" : { + "field" : "price", + "script" : "_value * conversion_rate", + "params" : { + "conversion_rate" : 1.2 + } + } + } + } +} +-------------------------------------------------- + diff --git a/docs/reference/aggregations/metrics/min-aggregation.asciidoc b/docs/reference/aggregations/metrics/min-aggregation.asciidoc new file mode 100644 index 0000000000..1383cc0832 --- /dev/null +++ b/docs/reference/aggregations/metrics/min-aggregation.asciidoc @@ -0,0 +1,68 @@ +[[search-aggregations-metrics-min-aggregation]] +=== Min Aggregation + +A `single-value` metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. + +Computing the min price value across all documents: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "min_price" : { "min" : { "field" : "price" } } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "min_price": { + "value": 10 + } + } +} +-------------------------------------------------- + +As can be seen, the name of the aggregation (`min_price` above) also serves as the key by which the aggregation result can be retrieved from the returned response. + +==== Script + +Computing the min price value across all document, this time using a script: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "min_price" : { "min" : { "script" : "doc['price'].value" } } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +==== Value Script + +Let's say that the prices of the documents in our index are in USD, but we would like to compute the min in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "min_price_in_euros" : { + "min" : { + "field" : "price", + "script" : "_value * conversion_rate", + "params" : { + "conversion_rate" : 1.2 + } + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc b/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc new file mode 100644 index 0000000000..6bd1011007 --- /dev/null +++ b/docs/reference/aggregations/metrics/percentile-aggregation.asciidoc @@ -0,0 +1,192 @@ +[[search-aggregations-metrics-percentile-aggregation]] +=== Percentiles Aggregation + +A `multi-value` metrics aggregation that calculates one or more percentiles +over numeric values extracted from the aggregated documents. These values +can be extracted either from specific numeric fields in the documents, or +be generated by a provided script. + +Percentiles show the point at which a certain percentage of observed values +occur. For example, the 95th percentile is the value which is greater than 95% +of the observed values. + +Percentiles are often used to find outliers. In normal distributions, the +0.13th and 99.87th percentiles represents three standard deviations from the +mean. Any data which falls outside three standard deviations is often considered +an anomaly. + +When a range of percentiles are retrieved, they can be used to estimate the +data distribution and determine if the data is skewed, bimodal, etc. + +Assume your data consists of website load times. The average and median +load times are not overly useful to an administrator. The max may be interesting, +but it can be easily skewed by a single slow response. + +Let's look at a range of percentiles representing load time: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "field" : "load_time" <1> + } + } + } +} +-------------------------------------------------- +<1> The field `load_time` must be a numeric field + +By default, the `percentile` metric will generate a range of +percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "load_time_outlier": { + "values" : { + "1.0": 15, + "5.0": 20, + "25.0": 23, + "50.0": 25, + "75.0": 29, + "95.0": 60, + "99.0": 150 + } + } + } +} +-------------------------------------------------- + +As you can see, the aggregation will return a calculated value for each percentile +in the default range. If we assume response times are in milliseconds, it is +immediately obvious that the webpage normally loads in 15-30ms, but occasionally +spikes to 60-150ms. + +Often, administrators are only interested in outliers -- the extreme percentiles. +We can specify just the percents we are interested in (requested percentiles +must be a value between 0-100 inclusive): + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "field" : "load_time", + "percents" : [95, 99, 99.9] <1> + } + } + } +} +-------------------------------------------------- +<1> Use the `percents` parameter to specify particular percentiles to calculate + + + +==== Script + +The percentile metric supports scripting. For example, if our load times +are in milliseconds but we want percentiles calculated in seconds, we could use +a script to convert them on-the-fly: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "script" : "doc['load_time'].value / timeUnit", <1> + "params" : { + "timeUnit" : 1000 <2> + } + } + } + } +} +-------------------------------------------------- +<1> The `field` parameter is replaced with a `script` parameter, which uses the +script to generate values which percentiles are calculated on +<2> Scripting supports parameterized input just like any other script + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +[[search-aggregations-metrics-percentile-aggregation-approximation]] +==== Percentiles are (usually) approximate + +There are many different algorithms to calculate percentiles. The naive +implementation simply stores all the values in a sorted array. To find the 50th +percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`. + +Clearly, the naive implementation does not scale -- the sorted array grows +linearly with the number of values in your dataset. To calculate percentiles +across potentially billions of values in an Elasticsearch cluster, _approximate_ +percentiles are calculated. + +The algorithm used by the `percentile` metric is called TDigest (introduced by +Ted Dunning in +https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]). + +When using this metric, there are a few guidelines to keep in mind: + +- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%) +are more accurate than less extreme percentiles, such as the median +- For small sets of values, percentiles are highly accurate (and potentially +100% accurate if the data is small enough). +- As the quantity of values in a bucket grows, the algorithm begins to approximate +the percentiles. It is effectively trading accuracy for memory savings. The +exact level of inaccuracy is difficult to generalize, since it depends on your +data distribution and volume of data being aggregated + +The following chart shows the relative error on a uniform distribution depending +on the number of collected values and the requested percentile: + +image:images/percentiles_error.png[] + +It shows how precision is better for extreme percentiles. The reason why error diminishes +for large number of values is that the law of large numbers makes the distribution of +values more and more uniform and the t-digest tree can do a better job at summarizing +it. It would not be the case on more skewed distributions. + +[[search-aggregations-metrics-percentile-aggregation-compression]] +==== Compression + +experimental[The `compression` parameter is specific to the current internal implementation of percentiles, and may change in the future] + +Approximate algorithms must balance memory utilization with estimation accuracy. +This balance can be controlled using a `compression` parameter: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentiles" : { + "field" : "load_time", + "compression" : 200 <1> + } + } + } +} +-------------------------------------------------- +<1> Compression controls memory usage and approximation error + +The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the +more nodes available, the higher the accuracy (and large memory footprint) proportional +to the volume of data. The `compression` parameter limits the maximum number of +nodes to `20 * compression`. + +Therefore, by increasing the compression value, you can increase the accuracy of +your percentiles at the cost of more memory. Larger compression values also +make the algorithm slower since the underlying tree data structure grows in size, +resulting in more expensive operations. The default compression value is +`100`. + +A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount +of data which arrives sorted and in-order) the default settings will produce a +TDigest roughly 64KB in size. In practice data tends to be more random and +the TDigest will use less memory. diff --git a/docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc b/docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc new file mode 100644 index 0000000000..d327fc6630 --- /dev/null +++ b/docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc @@ -0,0 +1,88 @@ +[[search-aggregations-metrics-percentile-rank-aggregation]] +=== Percentile Ranks Aggregation + +A `multi-value` metrics aggregation that calculates one or more percentile ranks +over numeric values extracted from the aggregated documents. These values +can be extracted either from specific numeric fields in the documents, or +be generated by a provided script. + +[NOTE] +================================================== +Please see <> +and <> for advice +regarding approximation and memory use of the percentile ranks aggregation +================================================== + +Percentile rank show the percentage of observed values which are below certain +value. For example, if a value is greater than or equal to 95% of the observed values +it is said to be at the 95th percentile rank. + +Assume your data consists of website load times. You may have a service agreement that +95% of page loads completely within 15ms and 99% of page loads complete within 30ms. + +Let's look at a range of percentiles representing load time: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentile_ranks" : { + "field" : "load_time", <1> + "values" : [15, 30] + } + } + } +} +-------------------------------------------------- +<1> The field `load_time` must be a numeric field + +The response will look like this: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "load_time_outlier": { + "values" : { + "15": 92, + "30": 100 + } + } + } +} +-------------------------------------------------- + +From this information you can determine you are hitting the 99% load time target but not quite +hitting the 95% load time target + + +==== Script + +The percentile rank metric supports scripting. For example, if our load times +are in milliseconds but we want to specify values in seconds, we could use +a script to convert them on-the-fly: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "load_time_outlier" : { + "percentile_ranks" : { + "values" : [3, 5], + "script" : "doc['load_time'].value / timeUnit", <1> + "params" : { + "timeUnit" : 1000 <2> + } + } + } + } +} +-------------------------------------------------- +<1> The `field` parameter is replaced with a `script` parameter, which uses the +script to generate values which percentile ranks are calculated on +<2> Scripting supports parameterized input just like any other script + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. diff --git a/docs/reference/aggregations/metrics/scripted-metric-aggregation.asciidoc b/docs/reference/aggregations/metrics/scripted-metric-aggregation.asciidoc new file mode 100644 index 0000000000..a775d54540 --- /dev/null +++ b/docs/reference/aggregations/metrics/scripted-metric-aggregation.asciidoc @@ -0,0 +1,237 @@ +[[search-aggregations-metrics-scripted-metric-aggregation]] +=== Scripted Metric Aggregation + +experimental[] + +A metric aggregation that executes using scripts to provide a metric output. + +Example: + +[source,js] +-------------------------------------------------- +{ + "query" : { + "match_all" : {} + }, + "aggs": { + "profit": { + "scripted_metric": { + "init_script" : "_agg['transactions'] = []", + "map_script" : "if (doc['type'].value == \"sale\") { _agg.transactions.add(doc['amount'].value) } else { _agg.transactions.add(-1 * doc['amount'].value) }", <1> + "combine_script" : "profit = 0; for (t in _agg.transactions) { profit += t }; return profit", + "reduce_script" : "profit = 0; for (a in _aggs) { profit += a }; return profit" + } + } + } +} +-------------------------------------------------- + +<1> `map_script` is the only required parameter + +The above aggregation demonstrates how one would use the script aggregation compute the total profit from sale and cost transactions. + +The response for the above aggregation: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "profit": { + "value": 170 + } + } +} +-------------------------------------------------- + +==== Scope of scripts + +The scripted metric aggregation uses scripts at 4 stages of its execution: + +init_script:: Executed prior to any collection of documents. Allows the aggregation to set up any initial state. ++ +In the above example, the `init_script` creates an array `transactions` in the `_agg` object. + +map_script:: Executed once per document collected. This is the only required script. If no combine_script is specified, the resulting state + needs to be stored in an object named `_agg`. ++ +In the above example, the `map_script` checks the value of the type field. If the value if 'sale' the value of the amount field +is added to the transactions array. If the value of the type field is not 'sale' the negated value of the amount field is added +to transactions. + +combine_script:: Executed once on each shard after document collection is complete. Allows the aggregation to consolidate the state returned from + each shard. If a combine_script is not provided the combine phase will return the aggregation variable. ++ +In the above example, the `combine_script` iterates through all the stored transactions, summing the values in the `profit` variable +and finally returns `profit`. + +reduce_script:: Executed once on the coordinating node after all shards have returned their results. The script is provided with access to a + variable `_aggs` which is an array of the result of the combine_script on each shard. If a reduce_script is not provided + the reduce phase will return the `_aggs` variable. ++ +In the above example, the `reduce_script` iterates through the `profit` returned by each shard summing the values before returning the +final combined profit which will be returned in the response of the aggregation. + +==== Worked Example + +Imagine a situation where you index the following documents into and index with 2 shards: + +[source,js] +-------------------------------------------------- +$ curl -XPUT 'http://localhost:9200/transactions/stock/1' -d ' +{ + "type": "sale", + "amount": 80 +} +' + +$ curl -XPUT 'http://localhost:9200/transactions/stock/2' -d ' +{ + "type": "cost", + "amount": 10 +} +' + +$ curl -XPUT 'http://localhost:9200/transactions/stock/3' -d ' +{ + "type": "cost", + "amount": 30 +} +' + +$ curl -XPUT 'http://localhost:9200/transactions/stock/4' -d ' +{ + "type": "sale", + "amount": 130 +} +' +-------------------------------------------------- + +Lets say that documents 1 and 3 end up on shard A and documents 2 and 4 end up on shard B. The following is a breakdown of what the aggregation result is +at each stage of the example above. + +===== Before init_script + +No params object was specified so the default params object is used: + +[source,js] +-------------------------------------------------- +"params" : { + "_agg" : {} +} +-------------------------------------------------- + +===== After init_script + +This is run once on each shard before any document collection is performed, and so we will have a copy on each shard: + +Shard A:: ++ +[source,js] +-------------------------------------------------- +"params" : { + "_agg" : { + "transactions" : [] + } +} +-------------------------------------------------- + +Shard B:: ++ +[source,js] +-------------------------------------------------- +"params" : { + "_agg" : { + "transactions" : [] + } +} +-------------------------------------------------- + +===== After map_script + +Each shard collects its documents and runs the map_script on each document that is collected: + +Shard A:: ++ +[source,js] +-------------------------------------------------- +"params" : { + "_agg" : { + "transactions" : [ 80, -30 ] + } +} +-------------------------------------------------- + +Shard B:: ++ +[source,js] +-------------------------------------------------- +"params" : { + "_agg" : { + "transactions" : [ -10, 130 ] + } +} +-------------------------------------------------- + +===== After combine_script + +The combine_script is executed on each shard after document collection is complete and reduces all the transactions down to a single profit figure for each +shard (by summing the values in the transactions array) which is passed back to the coordinating node: + +Shard A:: 50 +Shard B:: 120 + +===== After reduce_script + +The reduce_script receives an `_aggs` array containing the result of the combine script for each shard: + +[source,js] +-------------------------------------------------- +"_aggs" : [ + 50, + 120 +] +-------------------------------------------------- + +It reduces the responses for the shards down to a final overall profit figure (by summing the values) and returns this as the result of the aggregation to +produce the response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "profit": { + "value": 170 + } + } +} +-------------------------------------------------- + +==== Other Parameters + +[horizontal] +params:: Optional. An object whose contents will be passed as variables to the `init_script`, `map_script` and `combine_script`. This can be + useful to allow the user to control the behavior of the aggregation and for storing state between the scripts. If this is not specified, + the default is the equivalent of providing: ++ +[source,js] +-------------------------------------------------- +"params" : { + "_agg" : {} +} +-------------------------------------------------- +reduce_params:: Optional. An object whose contents will be passed as variables to the `reduce_script`. This can be useful to allow the user to control + the behavior of the reduce phase. If this is not specified the variable will be undefined in the reduce_script execution. +lang:: Optional. The script language used for the scripts. If this is not specified the default scripting language is used. +init_script_file:: Optional. Can be used in place of the `init_script` parameter to provide the script using in a file. +init_script_id:: Optional. Can be used in place of the `init_script` parameter to provide the script using an indexed script. +map_script_file:: Optional. Can be used in place of the `map_script` parameter to provide the script using in a file. +map_script_id:: Optional. Can be used in place of the `map_script` parameter to provide the script using an indexed script. +combine_script_file:: Optional. Can be used in place of the `combine_script` parameter to provide the script using in a file. +combine_script_id:: Optional. Can be used in place of the `combine_script` parameter to provide the script using an indexed script. +reduce_script_file:: Optional. Can be used in place of the `reduce_script` parameter to provide the script using in a file. +reduce_script_id:: Optional. Can be used in place of the `reduce_script` parameter to provide the script using an indexed script. + diff --git a/docs/reference/aggregations/metrics/stats-aggregation.asciidoc b/docs/reference/aggregations/metrics/stats-aggregation.asciidoc new file mode 100644 index 0000000000..7fbdecd601 --- /dev/null +++ b/docs/reference/aggregations/metrics/stats-aggregation.asciidoc @@ -0,0 +1,81 @@ +[[search-aggregations-metrics-stats-aggregation]] +=== Stats Aggregation + +A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. + +The stats that are returned consist of: `min`, `max`, `sum`, `count` and `avg`. + +Assuming the data consists of documents representing exams grades (between 0 and 100) of students + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "grades_stats" : { "stats" : { "field" : "grade" } } + } +} +-------------------------------------------------- + +The above aggregation computes the grades statistics over all documents. The aggregation type is `stats` and the `field` setting defines the numeric field of the documents the stats will be computed on. The above will return the following: + + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "grades_stats": { + "count": 6, + "min": 60, + "max": 98, + "avg": 78.5, + "sum": 471 + } + } +} +-------------------------------------------------- + +The name of the aggregation (`grades_stats` above) also serves as the key by which the aggregation result can be retrieved from the returned response. + +==== Script + +Computing the grades stats based on a script: + +[source,js] +-------------------------------------------------- +{ + ..., + + "aggs" : { + "grades_stats" : { "stats" : { "script" : "doc['grade'].value" } } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +===== Value Script + +It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use a value script to get the new stats: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + ... + + "aggs" : { + "grades_stats" : { + "stats" : { + "field" : "grade", + "script" : "_value * correction", + "params" : { + "correction" : 1.2 + } + } + } + } + } +} +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/metrics/sum-aggregation.asciidoc b/docs/reference/aggregations/metrics/sum-aggregation.asciidoc new file mode 100644 index 0000000000..8857ff306e --- /dev/null +++ b/docs/reference/aggregations/metrics/sum-aggregation.asciidoc @@ -0,0 +1,79 @@ +[[search-aggregations-metrics-sum-aggregation]] +=== Sum Aggregation + +A `single-value` metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. + +Assuming the data consists of documents representing stock ticks, where each tick holds the change in the stock price from the previous tick. + +[source,js] +-------------------------------------------------- +{ + "query" : { + "filtered" : { + "query" : { "match_all" : {}}, + "filter" : { + "range" : { "timestamp" : { "from" : "now/1d+9.5h", "to" : "now/1d+16h" }} + } + } + }, + "aggs" : { + "intraday_return" : { "sum" : { "field" : "change" } } + } +} +-------------------------------------------------- + +The above aggregation sums up all changes in the today's trading stock ticks which accounts for the intraday return. The aggregation type is `sum` and the `field` setting defines the numeric field of the documents of which values will be summed up. The above will return the following: + + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "intraday_return": { + "value": 2.18 + } + } +} +-------------------------------------------------- + +The name of the aggregation (`intraday_return` above) also serves as the key by which the aggregation result can be retrieved from the returned response. + +==== Script + +Computing the intraday return based on a script: + +[source,js] +-------------------------------------------------- +{ + ..., + + "aggs" : { + "intraday_return" : { "sum" : { "script" : "doc['change'].value" } } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. + +===== Value Script + +Computing the sum of squares over all stock tick changes: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + ... + + "aggs" : { + "daytime_return" : { + "sum" : { + "field" : "change", + "script" : "_value * _value" } + } + } + } +} +-------------------------------------------------- diff --git a/docs/reference/aggregations/metrics/tophits-aggregation.asciidoc b/docs/reference/aggregations/metrics/tophits-aggregation.asciidoc new file mode 100644 index 0000000000..b6e9c2caba --- /dev/null +++ b/docs/reference/aggregations/metrics/tophits-aggregation.asciidoc @@ -0,0 +1,275 @@ +[[search-aggregations-metrics-top-hits-aggregation]] +=== Top hits Aggregation + +A `top_hits` metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended +to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket. + +The `top_hits` aggregator can effectively be used to group result sets by certain fields via a bucket aggregator. +One or more bucket aggregators determines by which properties a result set get sliced into. + +==== Options + +* `from` - The offset from the first result you want to fetch. +* `size` - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned. +* `sort` - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query. + +==== Supported per hit features + +The top_hits aggregation returns regular search hits, because of this many per hit features can be supported: + +* <> +* <> +* <> +* <> +* <> +* <> +* <> + +==== Example + +In the following example we group the questions by tag and per tag we show the last active question. For each question +only the title field is being included in the source. + +[source,js] +-------------------------------------------------- +{ + "aggs": { + "top-tags": { + "terms": { + "field": "tags", + "size": 3 + }, + "aggs": { + "top_tag_hits": { + "top_hits": { + "sort": [ + { + "last_activity_date": { + "order": "desc" + } + } + ], + "_source": { + "include": [ + "title" + ] + }, + "size" : 1 + } + } + } + } + } +} +-------------------------------------------------- + +Possible response snippet: + +[source,js] +-------------------------------------------------- +"aggregations": { + "top-tags": { + "buckets": [ + { + "key": "windows-7", + "doc_count": 25365, + "top_tags_hits": { + "hits": { + "total": 25365, + "max_score": 1, + "hits": [ + { + "_index": "stack", + "_type": "question", + "_id": "602679", + "_score": 1, + "_source": { + "title": "Windows port opening" + }, + "sort": [ + 1370143231177 + ] + } + ] + } + } + }, + { + "key": "linux", + "doc_count": 18342, + "top_tags_hits": { + "hits": { + "total": 18342, + "max_score": 1, + "hits": [ + { + "_index": "stack", + "_type": "question", + "_id": "602672", + "_score": 1, + "_source": { + "title": "Ubuntu RFID Screensaver lock-unlock" + }, + "sort": [ + 1370143379747 + ] + } + ] + } + } + }, + { + "key": "windows", + "doc_count": 18119, + "top_tags_hits": { + "hits": { + "total": 18119, + "max_score": 1, + "hits": [ + { + "_index": "stack", + "_type": "question", + "_id": "602678", + "_score": 1, + "_source": { + "title": "If I change my computers date / time, what could be affected?" + }, + "sort": [ + 1370142868283 + ] + } + ] + } + } + } + ] + } +} +-------------------------------------------------- + +==== Field collapse example + +Field collapsing or result grouping is a feature that logically groups a result set into groups and per group returns +top documents. The ordering of the groups is determined by the relevancy of the first document in a group. In +Elasticsearch this can be implemented via a bucket aggregator that wraps a `top_hits` aggregator as sub-aggregator. + +In the example below we search across crawled webpages. For each webpage we store the body and the domain the webpage +belong to. By defining a `terms` aggregator on the `domain` field we group the result set of webpages by domain. The +`top_docs` aggregator is then defined as sub-aggregator, so that the top matching hits are collected per bucket. + +Also a `max` aggregator is defined which is used by the `terms` aggregator's order feature the return the buckets by +relevancy order of the most relevant document in a bucket. + +[source,js] +-------------------------------------------------- +{ + "query": { + "match": { + "body": "elections" + } + }, + "aggs": { + "top-sites": { + "terms": { + "field": "domain", + "order": { + "top_hit": "desc" + } + }, + "aggs": { + "top_tags_hits": { + "top_hits": {} + }, + "top_hit" : { + "max": { + "script": "_score" + } + } + } + } + } +} +-------------------------------------------------- + +At the moment the `max` (or `min`) aggregator is needed to make sure the buckets from the `terms` aggregator are +ordered according to the score of the most relevant webpage per domain. The `top_hits` aggregator isn't a metric aggregator +and therefore can't be used in the `order` option of the `terms` aggregator. + +==== top_hits support in a nested or reverse_nested aggregator + +If the `top_hits` aggregator is wrapped in a `nested` or `reverse_nested` aggregator then nested hits are being returned. +Nested hits are in a sense hidden mini documents that are part of regular document where in the mapping a nested field type +has been configured. The `top_hits` aggregator has the ability to un-hide these documents if it is wrapped in a `nested` +or `reverse_nested` aggregator. Read more about nested in the <>. + +If nested type has been configured a single document is actually indexed as multiple Lucene documents and they share +the same id. In order to determine the identity of a nested hit there is more needed than just the id, so that is why +nested hits also include their nested identity. The nested identity is kept under the `_nested` field in the search hit +and includes the array field and the offset in the array field the nested hit belongs to. The offset is zero based. + +Top hits response snippet with a nested hit, which resides in the third slot of array field `nested_field1` in document with id `1`: + +[source,js] +-------------------------------------------------- +... +"hits": { + "total": 25365, + "max_score": 1, + "hits": [ + { + "_index": "a", + "_type": "b", + "_id": "1", + "_score": 1, + "_nested" : { + "field" : "nested_field1", + "offset" : 2 + } + "_source": ... + }, + ... + ] +} +... +-------------------------------------------------- + +If `_source` is requested then just the part of the source of the nested object is returned, not the entire source of the document. +Also stored fields on the *nested* inner object level are accessible via `top_hits` aggregator residing in a `nested` or `reverse_nested` aggregator. + +Only nested hits will have a `_nested` field in the hit, non nested (regular) hits will not have a `_nested` field. + +The information in `_nested` can also be used to parse the original source somewhere else if `_source` isn't enabled. + +If there are multiple levels of nested object types defined in mappings then the `_nested` information can also be hierarchical +in order to express the identity of nested hits that are two layers deep or more. + +In the example below a nested hit resides in the first slot of the field `nested_grand_child_field` which then resides in +the second slow of the `nested_child_field` field: + +[source,js] +-------------------------------------------------- +... +"hits": { + "total": 2565, + "max_score": 1, + "hits": [ + { + "_index": "a", + "_type": "b", + "_id": "1", + "_score": 1, + "_nested" : { + "field" : "nested_child_field", + "offset" : 1, + "_nested" : { + "field" : "nested_grand_child_field", + "offset" : 0 + } + } + "_source": ... + }, + ... + ] +} +... +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/metrics/valuecount-aggregation.asciidoc b/docs/reference/aggregations/metrics/valuecount-aggregation.asciidoc new file mode 100644 index 0000000000..ed5e23ee33 --- /dev/null +++ b/docs/reference/aggregations/metrics/valuecount-aggregation.asciidoc @@ -0,0 +1,51 @@ +[[search-aggregations-metrics-valuecount-aggregation]] +=== Value Count Aggregation + +A `single-value` metrics aggregation that counts the number of values that are extracted from the aggregated documents. +These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically, +this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the `avg` +one might be interested in the number of values the average is computed over. + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "grades_count" : { "value_count" : { "field" : "grade" } } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + + "aggregations": { + "grades_count": { + "value": 10 + } + } +} +-------------------------------------------------- + +The name of the aggregation (`grades_count` above) also serves as the key by which the aggregation result can be +retrieved from the returned response. + +==== Script + +Counting the values generated by a script: + +[source,js] +-------------------------------------------------- +{ + ..., + + "aggs" : { + "grades_count" : { "value_count" : { "script" : "doc['grade'].value" } } + } +} +-------------------------------------------------- + +TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. diff --git a/docs/reference/aggregations/misc.asciidoc b/docs/reference/aggregations/misc.asciidoc new file mode 100644 index 0000000000..f494d5291c --- /dev/null +++ b/docs/reference/aggregations/misc.asciidoc @@ -0,0 +1,76 @@ + +[[caching-heavy-aggregations]] +== Caching heavy aggregations + +Frequently used aggregations (e.g. for display on the home page of a website) +can be cached for faster responses. These cached results are the same results +that would be returned by an uncached aggregation -- you will never get stale +results. + +See <> for more details. + +[[returning-only-agg-results]] +== Returning only aggregation results + +There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by +setting `size=0`. For example: + +[source,js] +-------------------------------------------------- +$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{ + "size": 0, + "aggregations": { + "my_agg": { + "terms": { + "field": "text" + } + } + } +} +' +-------------------------------------------------- + +Setting `size` to `0` avoids executing the fetch phase of the search making the request more efficient. + +[[agg-metadata]] +== Aggregation Metadata + +You can associate a piece of metadata with individual aggregations at request time that will be returned in place +at response time. + +Consider this example where we want to associate the color blue with our `terms` aggregation. + +[source,js] +-------------------------------------------------- +{ + ... + aggs": { + "titles": { + "terms": { + "field": "title" + }, + "meta": { + "color": "blue" + }, + } + } +} +-------------------------------------------------- + +Then that piece of metadata will be returned in place for our `titles` terms aggregation + +[source,js] +-------------------------------------------------- +{ + ... + "aggregations": { + "titles": { + "meta": { + "color" : "blue" + }, + "buckets": [ + ] + } + } +} +-------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/aggregations/reducer.asciidoc b/docs/reference/aggregations/reducer.asciidoc new file mode 100644 index 0000000000..2ce379cd58 --- /dev/null +++ b/docs/reference/aggregations/reducer.asciidoc @@ -0,0 +1,160 @@ +[[search-aggregations-reducer]] + +== Reducer Aggregations + +coming[2.0.0] + +experimental[] + +Reducer aggregations work on the outputs produced from other aggregations rather than from document sets, adding +information to the output tree. There are many different types of reducer, each computing different information from +other aggregations, but these types can broken down into two families: + +_Parent_:: + A family of reducer aggregations that is provided with the output of its parent aggregation and is able + to compute new buckets or new aggregations to add to existing buckets. + +_Sibling_:: + Reducer aggregations that are provided with the output of a sibling aggregation and are able to compute a + new aggregation which will be at the same level as the sibling aggregation. + +Reducer aggregations can reference the aggregations they need to perform their computation by using the `buckets_paths` +parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the +<> section below. + +Reducer aggregations cannot have sub-aggregations but depending on the type it can reference another reducer in the `buckets_path` +allowing reducers to be chained. For example, you can chain together two derivatives to calculate the second derivative +(e.g. a derivative of a derivative). + +NOTE: Because reducer aggregations only add to the output, when chaining reducer aggregations the output of each reducer will be +included in the final output. + +[[bucket-path-syntax]] +[float] +=== `buckets_path` Syntax + +Most reducers require another aggregation as their input. The input aggregation is defined via the `buckets_path` +parameter, which follows a specific format: + +-------------------------------------------------- +AGG_SEPARATOR := '>' +METRIC_SEPARATOR := '.' +AGG_NAME := +METRIC := +PATH := []*[] +-------------------------------------------------- + +For example, the path `"my_bucket>my_stats.avg"` will path to the `avg` value in the `"my_stats"` metric, which is +contained in the `"my_bucket"` bucket aggregation. + +Paths are relative from the position of the reducer; they are not absolute paths, and the path cannot go back "up" the +aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling" +metric `"the_sum"`: + +[source,js] +-------------------------------------------------- +{ + "my_date_histo":{ + "date_histogram":{ + "field":"timestamp", + "interval":"day" + }, + "aggs":{ + "the_sum":{ + "sum":{ "field": "lemmings" } <1> + }, + "the_movavg":{ + "moving_avg":{ "buckets_path": "the_sum" } <2> + } + } + } +} +-------------------------------------------------- +<1> The metric is called `"the_sum"` +<2> The `buckets_path` refers to the metric via a relative path `"the_sum"` + +`buckets_path` is also used for Sibling reducer aggregations, where the aggregation is "next" to a series of buckets +instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify +a metric embedded inside a sibling aggregation: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sales_per_month" : { + "date_histogram" : { + "field" : "date", + "interval" : "month" + }, + "aggs": { + "sales": { + "sum": { + "field": "price" + } + } + } + }, + "max_monthly_sales": { + "max_bucket": { + "buckets_paths": "sales_per_month>sales" <1> + } + } + } +} +-------------------------------------------------- +<1> `bucket_paths` instructs this max_bucket aggregation that we want the maximum value of the `sales` aggregation in the +`sales_per_month` date histogram. + +[float] +==== Special Paths + +Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs +the reducer to use the document count as it's input. For example, a moving average can be calculated on the document +count of each bucket, instead of a specific metric: + +[source,js] +-------------------------------------------------- +{ + "my_date_histo":{ + "date_histogram":{ + "field":"timestamp", + "interval":"day" + }, + "aggs":{ + "the_movavg":{ + "moving_avg":{ "buckets_path": "_count" } <1> + } + } + } +} +-------------------------------------------------- +<1> By using `_count` instead of a metric name, we can calculate the moving average of document counts in the histogram + + +[float] +=== Dealing with gaps in the data + +There are a couple of reasons why the data output by the enclosing histogram may have gaps: + +* There are no documents matching the query for some buckets +* The data for a metric is missing in all of the documents falling into a bucket (this is most likely with either a small interval +on the enclosing histogram or with a query matching only a small number of documents) + +Where there is no data available in a bucket for a given metric it presents a problem for calculating the derivative value for both +the current bucket and the next bucket. In the derivative reducer aggregation has a `gap policy` parameter to define what the behavior +should be when a gap in the data is found. There are currently two options for controlling the gap policy: + +_ignore_:: + This option will not produce a derivative value for any buckets where the value in the current or previous bucket is + missing + +_insert_zeros_:: + This option will assume the missing value is `0` and calculate the derivative with the value `0`. + + + + +include::reducer/derivative-aggregation.asciidoc[] +include::reducer/max-bucket-aggregation.asciidoc[] +include::reducer/min-bucket-aggregation.asciidoc[] +include::reducer/movavg-aggregation.asciidoc[] diff --git a/docs/reference/aggregations/reducer/derivative-aggregation.asciidoc b/docs/reference/aggregations/reducer/derivative-aggregation.asciidoc new file mode 100644 index 0000000000..1780105541 --- /dev/null +++ b/docs/reference/aggregations/reducer/derivative-aggregation.asciidoc @@ -0,0 +1,196 @@ +[[search-aggregations-reducer-derivative-aggregation]] +=== Derivative Aggregation + +A parent reducer aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram) +aggregation. The specified metric must be numeric and the enclosing histogram must have `min_doc_count` set to `0` (default +for `histogram` aggregations). + +==== Syntax + +A `derivative` aggregation looks like this in isolation: + +[source,js] +-------------------------------------------------- +{ + "derivative": { + "buckets_path": "the_sum" + } +} +-------------------------------------------------- + +.`derivative` Parameters +|=== +|Parameter Name |Description |Required |Default Value +|`buckets_path` |Path to the metric of interest (see <> for more details |Required | +|=== + + +==== First Order Derivative + +The following snippet calculates the derivative of the total monthly `sales`: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sales_per_month" : { + "date_histogram" : { + "field" : "date", + "interval" : "month" + }, + "aggs": { + "sales": { + "sum": { + "field": "price" + } + }, + "sales_deriv": { + "derivative": { + "buckets_paths": "sales" <1> + } + } + } + } + } +} +-------------------------------------------------- + +<1> `bucket_paths` instructs this derivative aggregation to use the output of the `sales` aggregation for the derivative + +And the following may be the response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "sales_per_month": { + "buckets": [ + { + "key_as_string": "2015/01/01 00:00:00", + "key": 1420070400000, + "doc_count": 3, + "sales": { + "value": 550 + } <1> + }, + { + "key_as_string": "2015/02/01 00:00:00", + "key": 1422748800000, + "doc_count": 2, + "sales": { + "value": 60 + }, + "sales_deriv": { + "value": -490 <2> + } + }, + { + "key_as_string": "2015/03/01 00:00:00", + "key": 1425168000000, + "doc_count": 2, <3> + "sales": { + "value": 375 + }, + "sales_deriv": { + "value": 315 + } + } + ] + } + } +} +-------------------------------------------------- + +<1> No derivative for the first bucket since we need at least 2 data points to calculate the derivative +<2> Derivative value units are implicitly defined by the `sales` aggregation and the parent histogram so in this case the units +would be $/month assuming the `price` field has units of $. +<3> The number of documents in the bucket are represented by the `doc_count` f + +==== Second Order Derivative + +A second order derivative can be calculated by chaining the derivative reducer aggregation onto the result of another derivative +reducer aggregation as in the following example which will calculate both the first and the second order derivative of the total +monthly sales: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sales_per_month" : { + "date_histogram" : { + "field" : "date", + "interval" : "month" + }, + "aggs": { + "sales": { + "sum": { + "field": "price" + } + }, + "sales_deriv": { + "derivative": { + "buckets_paths": "sales" + } + }, + "sales_2nd_deriv": { + "derivative": { + "buckets_paths": "sales_deriv" <1> + } + } + } + } + } +} +-------------------------------------------------- + +<1> `bucket_paths` for the second derivative points to the name of the first derivative + +And the following may be the response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "sales_per_month": { + "buckets": [ + { + "key_as_string": "2015/01/01 00:00:00", + "key": 1420070400000, + "doc_count": 3, + "sales": { + "value": 550 + } <1> + }, + { + "key_as_string": "2015/02/01 00:00:00", + "key": 1422748800000, + "doc_count": 2, + "sales": { + "value": 60 + }, + "sales_deriv": { + "value": -490 + } <1> + }, + { + "key_as_string": "2015/03/01 00:00:00", + "key": 1425168000000, + "doc_count": 2, + "sales": { + "value": 375 + }, + "sales_deriv": { + "value": 315 + }, + "sales_2nd_deriv": { + "value": 805 + } + } + ] + } + } +} +-------------------------------------------------- +<1> No second derivative for the first two buckets since we need at least 2 data points from the first derivative to calculate the +second derivative + diff --git a/docs/reference/aggregations/reducer/max-bucket-aggregation.asciidoc b/docs/reference/aggregations/reducer/max-bucket-aggregation.asciidoc new file mode 100644 index 0000000000..e1a5e9aa38 --- /dev/null +++ b/docs/reference/aggregations/reducer/max-bucket-aggregation.asciidoc @@ -0,0 +1,101 @@ +[[search-aggregations-reducer-max-bucket-aggregation]] +=== Max Bucket Aggregation + +A sibling reducer aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibing aggregation +and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must +be a multi-bucket aggregation. + +==== Syntax + +A `max_bucket` aggregation looks like this in isolation: + +[source,js] +-------------------------------------------------- +{ + "max_bucket": { + "buckets_path": "the_sum" + } +} +-------------------------------------------------- + +.`max_bucket` Parameters +|=== +|Parameter Name |Description |Required |Default Value +|`buckets_path` |The path to the buckets we wish to find the maximum for (see <> for more + details |Required | +|=== + +The following snippet calculates the maximum of the total monthly `sales`: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sales_per_month" : { + "date_histogram" : { + "field" : "date", + "interval" : "month" + }, + "aggs": { + "sales": { + "sum": { + "field": "price" + } + } + } + }, + "max_monthly_sales": { + "max_bucket": { + "buckets_paths": "sales_per_month>sales" <1> + } + } + } +} +-------------------------------------------------- +<1> `bucket_paths` instructs this max_bucket aggregation that we want the maximum value of the `sales` aggregation in the +`sales_per_month` date histogram. + +And the following may be the response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "sales_per_month": { + "buckets": [ + { + "key_as_string": "2015/01/01 00:00:00", + "key": 1420070400000, + "doc_count": 3, + "sales": { + "value": 550 + } + }, + { + "key_as_string": "2015/02/01 00:00:00", + "key": 1422748800000, + "doc_count": 2, + "sales": { + "value": 60 + } + }, + { + "key_as_string": "2015/03/01 00:00:00", + "key": 1425168000000, + "doc_count": 2, + "sales": { + "value": 375 + } + } + ] + }, + "max_monthly_sales": { + "keys": ["2015/01/01 00:00:00"], <1> + "value": 550 + } + } +} +-------------------------------------------------- + +<1> `keys` is an array of strings since the maximum value may be present in multiple buckets + diff --git a/docs/reference/aggregations/reducer/min-bucket-aggregation.asciidoc b/docs/reference/aggregations/reducer/min-bucket-aggregation.asciidoc new file mode 100644 index 0000000000..1ea26c17a2 --- /dev/null +++ b/docs/reference/aggregations/reducer/min-bucket-aggregation.asciidoc @@ -0,0 +1,102 @@ +[[search-aggregations-reducer-min-bucket-aggregation]] +=== Min Bucket Aggregation + +A sibling reducer aggregation which identifies the bucket(s) with the minimum value of a specified metric in a sibling aggregation +and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must +be a multi-bucket aggregation. + +==== Syntax + +A `max_bucket` aggregation looks like this in isolation: + +[source,js] +-------------------------------------------------- +{ + "min_bucket": { + "buckets_path": "the_sum" + } +} +-------------------------------------------------- + +.`min_bucket` Parameters +|=== +|Parameter Name |Description |Required |Default Value +|`buckets_path` |Path to the metric of interest (see <> for more details |Required | +|=== + + +The following snippet calculates the minimum of the total monthly `sales`: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sales_per_month" : { + "date_histogram" : { + "field" : "date", + "interval" : "month" + }, + "aggs": { + "sales": { + "sum": { + "field": "price" + } + } + } + }, + "min_monthly_sales": { + "min_bucket": { + "buckets_paths": "sales_per_month>sales" <1> + } + } + } +} +-------------------------------------------------- + +<1> `bucket_paths` instructs this max_bucket aggregation that we want the minimum value of the `sales` aggregation in the +`sales_per_month` date histogram. + +And the following may be the response: + +[source,js] +-------------------------------------------------- +{ + "aggregations": { + "sales_per_month": { + "buckets": [ + { + "key_as_string": "2015/01/01 00:00:00", + "key": 1420070400000, + "doc_count": 3, + "sales": { + "value": 550 + } + }, + { + "key_as_string": "2015/02/01 00:00:00", + "key": 1422748800000, + "doc_count": 2, + "sales": { + "value": 60 + } + }, + { + "key_as_string": "2015/03/01 00:00:00", + "key": 1425168000000, + "doc_count": 2, + "sales": { + "value": 375 + } + } + ] + }, + "min_monthly_sales": { + "keys": ["2015/02/01 00:00:00"], <1> + "value": 60 + } + } +} +-------------------------------------------------- + +<1> `keys` is an array of strings since the minimum value may be present in multiple buckets + diff --git a/docs/reference/aggregations/reducer/movavg-aggregation.asciidoc b/docs/reference/aggregations/reducer/movavg-aggregation.asciidoc new file mode 100644 index 0000000000..18cf98d263 --- /dev/null +++ b/docs/reference/aggregations/reducer/movavg-aggregation.asciidoc @@ -0,0 +1,274 @@ +[[search-aggregations-reducers-movavg-reducer]] +=== Moving Average Aggregation + +Given an ordered series of data, the Moving Average aggregation will slide a window across the data and emit the average +value of that window. For example, given the data `[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`, we can calculate a simple moving +average with windows size of `5` as follows: + +- (1 + 2 + 3 + 4 + 5) / 5 = 3 +- (2 + 3 + 4 + 5 + 6) / 5 = 4 +- (3 + 4 + 5 + 6 + 7) / 5 = 5 +- etc + +Moving averages are a simple method to smooth sequential data. Moving averages are typically applied to time-based data, +such as stock prices or server metrics. The smoothing can be used to eliminate high frequency fluctuations or random noise, +which allows the lower frequency trends to be more easily visualized, such as seasonality. + +==== Syntax + +A `moving_avg` aggregation looks like this in isolation: + +[source,js] +-------------------------------------------------- +{ + "movavg": { + "buckets_path": "the_sum", + "model": "double_exp", + "window": 5, + "gap_policy": "insert_zero", + "settings": { + "alpha": 0.8 + } + } +} +-------------------------------------------------- + +.`moving_avg` Parameters +|=== +|Parameter Name |Description |Required |Default Value +|`buckets_path` |Path to the metric of interest (see <> for more details |Required | +|`model` |The moving average weighting model that we wish to use |Optional |`simple` +|`gap_policy` |Determines what should happen when a gap in the data is encountered. |Optional |`insert_zero` +|`window` |The size of window to "slide" across the histogram. |Optional |`5` +|`settings` |Model-specific settings, contents which differ depending on the model specified. |Optional | +|=== + +`moving_avg` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be +embedded like any other metric aggregation: + +[source,js] +-------------------------------------------------- +{ + "my_date_histo":{ <1> + "date_histogram":{ + "field":"timestamp", + "interval":"day" + }, + "aggs":{ + "the_sum":{ + "sum":{ "field": "lemmings" } <2> + }, + "the_movavg":{ + "moving_avg":{ "buckets_path": "the_sum" } <3> + } + } + } +} +-------------------------------------------------- +<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals +<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc) +<3> Finally, we specify a `moving_avg` aggregation which uses "the_sum" metric as its input. + +Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally +add normal metrics, such as a `sum`, inside of that histogram. Finally, the `moving_avg` is embedded inside the histogram. +The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see +<> for a description of the syntax for `buckets_path`. + + +==== Models + +The `moving_avg` aggregation includes four different moving average "models". The main difference is how the values in the +window are weighted. As data-points become "older" in the window, they may be weighted differently. This will +affect the final average for that window. + +Models are specified using the `model` parameter. Some models may have optional configurations which are specified inside +the `settings` parameter. + +===== Simple + +The `simple` model calculates the sum of all values in the window, then divides by the size of the window. It is effectively +a simple arithmetic mean of the window. The simple model does not perform any time-dependent weighting, which means +the values from a `simple` moving average tend to "lag" behind the real data. + +[source,js] +-------------------------------------------------- +{ + "the_movavg":{ + "moving_avg":{ + "buckets_path": "the_sum", + "model" : "simple" + } + } +} +-------------------------------------------------- + +A `simple` model has no special settings to configure + +The window size can change the behavior of the moving average. For example, a small window (`"window": 10`) will closely +track the data and only smooth out small scale fluctuations: + +[[movavg_10window]] +.Moving average with window of size 10 +image::images/reducers_movavg/movavg_10window.png[] + +In contrast, a `simple` moving average with larger window (`"window": 100`) will smooth out all higher-frequency fluctuations, +leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount: + +[[movavg_100window]] +.Moving average with window of size 100 +image::images/reducers_movavg/movavg_100window.png[] + + +==== Linear + +The `linear` model assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at +the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce +the "lag" behind the data's mean, since older points have less influence. + +[source,js] +-------------------------------------------------- +{ + "the_movavg":{ + "moving_avg":{ + "buckets_path": "the_sum", + "model" : "linear" + } +} +-------------------------------------------------- + +A `linear` model has no special settings to configure + +Like the `simple` model, window size can change the behavior of the moving average. For example, a small window (`"window": 10`) +will closely track the data and only smooth out small scale fluctuations: + +[[linear_10window]] +.Linear moving average with window of size 10 +image::images/reducers_movavg/linear_10window.png[] + +In contrast, a `linear` moving average with larger window (`"window": 100`) will smooth out all higher-frequency fluctuations, +leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount, +although typically less than the `simple` model: + +[[linear_100window]] +.Linear moving average with window of size 100 +image::images/reducers_movavg/linear_100window.png[] + +==== Single Exponential + +The `single_exp` model is similar to the `linear` model, except older data-points become exponentially less important, +rather than linearly less important. The speed at which the importance decays can be controlled with an `alpha` +setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger +portion of the window. Larger valuers make the weight decay quickly, which reduces the impact of older values on the +moving average. This tends to make the moving average track the data more closely but with less smoothing. + +The default value of `alpha` is `0.5`, and the setting accepts any float from 0-1 inclusive. + +[source,js] +-------------------------------------------------- +{ + "the_movavg":{ + "moving_avg":{ + "buckets_path": "the_sum", + "model" : "single_exp", + "settings" : { + "alpha" : 0.5 + } + } +} +-------------------------------------------------- + + + +[[single_0.2alpha]] +.Single Exponential moving average with window of size 10, alpha = 0.2 +image::images/reducers_movavg/single_0.2alpha.png[] + +[[single_0.7alpha]] +.Single Exponential moving average with window of size 10, alpha = 0.7 +image::images/reducers_movavg/single_0.7alpha.png[] + +==== Double Exponential + +The `double_exp` model, sometimes called "Holt's Linear Trend" model, incorporates a second exponential term which +tracks the data's trend. Single exponential does not perform well when the data has an underlying linear trend. The +double exponential model calculates two values internally: a "level" and a "trend". + +The level calculation is similar to `single_exp`, and is an exponentially weighted view of the data. The difference is +that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series. +The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the +smoothed data). The trend value is also exponentially weighted. + +Values are produced by multiplying the level and trend components. + +The default value of `alpha` and `beta` is `0.5`, and the settings accept any float from 0-1 inclusive. + +[source,js] +-------------------------------------------------- +{ + "the_movavg":{ + "moving_avg":{ + "buckets_path": "the_sum", + "model" : "double_exp", + "settings" : { + "alpha" : 0.5, + "beta" : 0.5 + } + } +} +-------------------------------------------------- + +In practice, the `alpha` value behaves very similarly in `double_exp` as `single_exp`: small values produce more smoothing +and more lag, while larger values produce closer tracking and less lag. The value of `beta` is often difficult +to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger +values emphasize short-term trends. This will become more apparently when you are predicting values. + +[[double_0.2beta]] +.Double Exponential moving average with window of size 100, alpha = 0.5, beta = 0.2 +image::images/reducers_movavg/double_0.2beta.png[] + +[[double_0.7beta]] +.Double Exponential moving average with window of size 100, alpha = 0.5, beta = 0.7 +image::images/reducers_movavg/double_0.7beta.png[] + +==== Prediction + +All the moving average model support a "prediction" mode, which will attempt to extrapolate into the future given the +current smoothed, moving average. Depending on the model and parameter, these predictions may or may not be accurate. + +Predictions are enabled by adding a `predict` parameter to any moving average aggregation, specifying the nubmer of +predictions you would like appended to the end of the series. These predictions will be spaced out at the same interval +as your buckets: + +[source,js] +-------------------------------------------------- +{ + "the_movavg":{ + "moving_avg":{ + "buckets_path": "the_sum", + "model" : "simple", + "predict" 10 + } +} +-------------------------------------------------- + +The `simple`, `linear` and `single_exp` models all produce "flat" predictions: they essentially converge on the mean +of the last value in the series, producing a flat: + +[[simple_prediction]] +.Simple moving average with window of size 10, predict = 50 +image::images/reducers_movavg/simple_prediction.png[] + +In contrast, the `double_exp` model can extrapolate based on local or global constant trends. If we set a high `beta` +value, we can extrapolate based on local constant trends (in this case the predictions head down, because the data at the end +of the series was heading in a downward direction): + +[[double_prediction_local]] +.Double Exponential moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.8 +image::images/reducers_movavg/double_prediction_local.png[] + +In contrast, if we choose a small `beta`, the predictions are based on the global constant trend. In this series, the +global trend is slightly positive, so the prediction makes a sharp u-turn and begins a positive slope: + +[[double_prediction_global]] +.Double Exponential moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.1 +image::images/reducers_movavg/double_prediction_global.png[] diff --git a/docs/reference/index.asciidoc b/docs/reference/index.asciidoc index 1e63d18a4d..696fbaa3bc 100644 --- a/docs/reference/index.asciidoc +++ b/docs/reference/index.asciidoc @@ -18,6 +18,8 @@ include::docs.asciidoc[] include::search.asciidoc[] +include::aggregations.asciidoc[] + include::indices.asciidoc[] include::cat.asciidoc[] diff --git a/docs/reference/search.asciidoc b/docs/reference/search.asciidoc index 79d3c7a93f..b71a0dfe46 100644 --- a/docs/reference/search.asciidoc +++ b/docs/reference/search.asciidoc @@ -85,8 +85,6 @@ include::search/search-template.asciidoc[] include::search/search-shards.asciidoc[] -include::search/aggregations.asciidoc[] - include::search/facets.asciidoc[] include::search/suggesters.asciidoc[] diff --git a/docs/reference/search/aggregations.asciidoc b/docs/reference/search/aggregations.asciidoc deleted file mode 100644 index cf4b4348ed..0000000000 --- a/docs/reference/search/aggregations.asciidoc +++ /dev/null @@ -1,234 +0,0 @@ -[[search-aggregations]] -== Aggregations - -The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks -called aggregations, that can be composed in order to build complex summaries of the data. - -An aggregation can be seen as a _unit-of-work_ that builds analytic information over a set of documents. The context of -the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed -query/filters of the search request). - -There are many different types of aggregations, each with its own purpose and output. To better understand these types, -it is often easier to break them into two main families: - -_Bucketing_:: - A family of aggregations that build buckets, where each bucket is associated with a _key_ and a document - criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in - the context and when a criterion matches, the document is considered to "fall in" the relevant bucket. - By the end of the aggregation process, we'll end up with a list of buckets - each one with a set of - documents that "belong" to it. - -_Metric_:: - Aggregations that keep track and compute metrics over a set of documents. - -The interesting part comes next. Since each bucket effectively defines a document set (all documents belonging to -the bucket), one can potentially associate aggregations on the bucket level, and those will execute within the context -of that bucket. This is where the real power of aggregations kicks in: *aggregations can be nested!* - -NOTE: Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub-aggregations will be computed for - the buckets which their parent aggregation generates. There is no hard limit on the level/depth of nested - aggregations (one can nest an aggregation under a "parent" aggregation, which is itself a sub-aggregation of - another higher-level aggregation). - -[float] -=== Structuring Aggregations - -The following snippet captures the basic structure of aggregations: - -[source,js] --------------------------------------------------- -"aggregations" : { - "" : { - "" : { - - } - [,"meta" : { [] } ]? - [,"aggregations" : { []+ } ]? - } - [,"" : { ... } ]* -} --------------------------------------------------- - -The `aggregations` object (the key `aggs` can also be used) in the JSON holds the aggregations to be computed. Each aggregation -is associated with a logical name that the user defines (e.g. if the aggregation computes the average price, then it would -make sense to name it `avg_price`). These logical names will also be used to uniquely identify the aggregations in the -response. Each aggregation has a specific type (`` in the above snippet) and is typically the first -key within the named aggregation body. Each type of aggregation defines its own body, depending on the nature of the -aggregation (e.g. an `avg` aggregation on a specific field will define the field on which the average will be calculated). -At the same level of the aggregation type definition, one can optionally define a set of additional aggregations, -though this only makes sense if the aggregation you defined is of a bucketing nature. In this scenario, the -sub-aggregations you define on the bucketing aggregation level will be computed for all the buckets built by the -bucketing aggregation. For example, if you define a set of aggregations under the `range` aggregation, the -sub-aggregations will be computed for the range buckets that are defined. - -[float] -==== Values Source - -Some aggregations work on values extracted from the aggregated documents. Typically, the values will be extracted from -a specific document field which is set using the `field` key for the aggregations. It is also possible to define a -<> which will generate the values (per document). - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -When both `field` and `script` settings are configured for the aggregation, the script will be treated as a -`value script`. While normal scripts are evaluated on a document level (i.e. the script has access to all the data -associated with the document), value scripts are evaluated on the *value* level. In this mode, the values are extracted -from the configured `field` and the `script` is used to apply a "transformation" over these value/s. - -["NOTE",id="aggs-script-note"] -=============================== -When working with scripts, the `lang` and `params` settings can also be defined. The former defines the scripting -language which is used (assuming the proper language is available in Elasticsearch, either by default or as a plugin). The latter -enables defining all the "dynamic" expressions in the script as parameters, which enables the script to keep itself static -between calls (this will ensure the use of the cached compiled scripts in Elasticsearch). -=============================== - -Scripts can generate a single value or multiple values per document. When generating multiple values, one can use the -`script_values_sorted` settings to indicate whether these values are sorted or not. Internally, Elasticsearch can -perform optimizations when dealing with sorted values (for example, with the `min` aggregations, knowing the values are -sorted, Elasticsearch will skip the iterations over all the values and rely on the first value in the list to be the -minimum value among all other values associated with the same document). - -[float] -=== Metrics Aggregations - -The aggregations in this family compute metrics based on values extracted in one way or another from the documents that -are being aggregated. The values are typically extracted from the fields of the document (using the field data), but -can also be generated using scripts. - -Numeric metrics aggregations are a special type of metrics aggregation which output numeric values. Some aggregations output -a single numeric metric (e.g. `avg`) and are called `single-value numeric metrics aggregation`, others generate multiple -metrics (e.g. `stats`) and are called `multi-value numeric metrics aggregation`. The distinction between single-value and -multi-value numeric metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some -bucket aggregations (some bucket aggregations enable you to sort the returned buckets based on the numeric metrics in each bucket). - - -[float] -=== Bucket Aggregations - -Bucket aggregations don't calculate metrics over fields like the metrics aggregations do, but instead, they create -buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines -whether or not a document in the current context "falls" into it. In other words, the buckets effectively define document -sets. In addition to the buckets themselves, the `bucket` aggregations also compute and return the number of documents -that "fell in" to each bucket. - -Bucket aggregations, as opposed to `metrics` aggregations, can hold sub-aggregations. These sub-aggregations will be -aggregated for the buckets created by their "parent" bucket aggregation. - -There are different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some -define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process. - -[float] -=== Reducer Aggregations - -coming[2.0.0] - -experimental[] - -Reducer aggregations work on the outputs produced from other aggregations rather than from document sets, adding -information to the output tree. There are many different types of reducer, each computing different information from -other aggregations, but these types can broken down into two families: - -_Parent_:: - A family of reducer aggregations that is provided with the output of its parent aggregation and is able - to compute new buckets or new aggregations to add to existing buckets. - -_Sibling_:: - Reducer aggregations that are provided with the output of a sibling aggregation and are able to compute a - new aggregation which will be at the same level as the sibling aggregation. - -Reducer aggregations can reference the aggregations they need to perform their computation by using the `buckets_paths` -parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the -<> section. - -?????? SHOULD THE SECTION ABOUT DEFINING AGGREGATION PATHS -BE IN THIS PAGE AND REFERENCED FROM THE TERMS AGGREGATION DOCUMENTATION ??????? - -Reducer aggregations cannot have sub-aggregations but depending on the type it can reference another reducer in the `buckets_path` -allowing reducers to be chained. - -NOTE: Because reducer aggregations only add to the output, when chaining reducer aggregations the output of each reducer will be -included in the final output. - -[float] -=== Caching heavy aggregations - -Frequently used aggregations (e.g. for display on the home page of a website) -can be cached for faster responses. These cached results are the same results -that would be returned by an uncached aggregation -- you will never get stale -results. - -See <> for more details. - -[float] -=== Returning only aggregation results - -There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by -setting `size=0`. For example: - -[source,js] --------------------------------------------------- -$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{ - "size": 0, - "aggregations": { - "my_agg": { - "terms": { - "field": "text" - } - } - } -} -' --------------------------------------------------- - -Setting `size` to `0` avoids executing the fetch phase of the search making the request more efficient. - -[float] -=== Metadata - -You can associate a piece of metadata with individual aggregations at request time that will be returned in place -at response time. - -Consider this example where we want to associate the color blue with our `terms` aggregation. - -[source,js] --------------------------------------------------- -{ - ... - aggs": { - "titles": { - "terms": { - "field": "title" - }, - "meta": { - "color": "blue" - }, - } - } -} --------------------------------------------------- - -Then that piece of metadata will be returned in place for our `titles` terms aggregation - -[source,js] --------------------------------------------------- -{ - ... - "aggregations": { - "titles": { - "meta": { - "color" : "blue" - }, - "buckets": [ - ] - } - } -} --------------------------------------------------- - -include::aggregations/metrics.asciidoc[] - -include::aggregations/bucket.asciidoc[] - -include::aggregations/reducer.asciidoc[] - diff --git a/docs/reference/search/aggregations/bucket.asciidoc b/docs/reference/search/aggregations/bucket.asciidoc deleted file mode 100644 index 7d7848fa1a..0000000000 --- a/docs/reference/search/aggregations/bucket.asciidoc +++ /dev/null @@ -1,33 +0,0 @@ -[[search-aggregations-bucket]] - -include::bucket/global-aggregation.asciidoc[] - -include::bucket/filter-aggregation.asciidoc[] - -include::bucket/filters-aggregation.asciidoc[] - -include::bucket/missing-aggregation.asciidoc[] - -include::bucket/nested-aggregation.asciidoc[] - -include::bucket/reverse-nested-aggregation.asciidoc[] - -include::bucket/children-aggregation.asciidoc[] - -include::bucket/terms-aggregation.asciidoc[] - -include::bucket/significantterms-aggregation.asciidoc[] - -include::bucket/range-aggregation.asciidoc[] - -include::bucket/daterange-aggregation.asciidoc[] - -include::bucket/iprange-aggregation.asciidoc[] - -include::bucket/histogram-aggregation.asciidoc[] - -include::bucket/datehistogram-aggregation.asciidoc[] - -include::bucket/geodistance-aggregation.asciidoc[] - -include::bucket/geohashgrid-aggregation.asciidoc[] diff --git a/docs/reference/search/aggregations/bucket/children-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/children-aggregation.asciidoc deleted file mode 100644 index e69877d97f..0000000000 --- a/docs/reference/search/aggregations/bucket/children-aggregation.asciidoc +++ /dev/null @@ -1,344 +0,0 @@ -[[search-aggregations-bucket-children-aggregation]] -=== Children Aggregation - -A special single bucket aggregation that enables aggregating from buckets on parent document types to buckets on child documents. - -This aggregation relies on the <> in the mapping. This aggregation has a single option: - -* `type` - The what child type the buckets in the parent space should be mapped to. - -For example, let's say we have an index of questions and answers. The answer type has the following `_parent` field in the mapping: -[source,js] --------------------------------------------------- -{ - "answer" : { - "_parent" : { - "type" : "question" - } - } -} --------------------------------------------------- - -The question typed document contain a tag field and the answer typed documents contain an owner field. With the `children` -aggregation the tag buckets can be mapped to the owner buckets in a single request even though the two fields exist in -two different kinds of documents. - -An example of a question typed document: -[source,js] --------------------------------------------------- -{ - "body": "

I have Windows 2003 server and i bought a new Windows 2008 server...", - "title": "Whats the best way to file transfer my site from server to a newer one?", - "tags": [ - "windows-server-2003", - "windows-server-2008", - "file-transfer" - ], -} --------------------------------------------------- - -An example of an answer typed document: -[source,js] --------------------------------------------------- -{ - "owner": { - "location": "Norfolk, United Kingdom", - "display_name": "Sam", - "id": 48 - }, - "body": "

Unfortunately your pretty much limited to FTP...", - "creation_date": "2009-05-04T13:45:37.030" -} --------------------------------------------------- - -The following request can be built that connects the two together: - -[source,js] --------------------------------------------------- -{ - "aggs": { - "top-tags": { - "terms": { - "field": "tags", - "size": 10 - }, - "aggs": { - "to-answers": { - "children": { - "type" : "answer" <1> - }, - "aggs": { - "top-names": { - "terms": { - "field": "owner.display_name", - "size": 10 - } - } - } - } - } - } - } -} --------------------------------------------------- - -<1> The `type` points to type / mapping with the name `answer`. - -The above example returns the top question tags and per tag the top answer owners. - -Possible response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "top-tags": { - "buckets": [ - { - "key": "windows-server-2003", - "doc_count": 25365, <1> - "to-answers": { - "doc_count": 36004, <2> - "top-names": { - "buckets": [ - { - "key": "Sam", - "doc_count": 274 - }, - { - "key": "chris", - "doc_count": 19 - }, - { - "key": "david", - "doc_count": 14 - }, - ... - ] - } - } - }, - { - "key": "linux", - "doc_count": 18342, - "to-answers": { - "doc_count": 6655, - "top-names": { - "buckets": [ - { - "key": "abrams", - "doc_count": 25 - }, - { - "key": "ignacio", - "doc_count": 25 - }, - { - "key": "vazquez", - "doc_count": 25 - }, - ... - ] - } - } - }, - { - "key": "windows", - "doc_count": 18119, - "to-answers": { - "doc_count": 24051, - "top-names": { - "buckets": [ - { - "key": "molly7244", - "doc_count": 265 - }, - { - "key": "david", - "doc_count": 27 - }, - { - "key": "chris", - "doc_count": 26 - }, - ... - ] - } - } - }, - { - "key": "osx", - "doc_count": 10971, - "to-answers": { - "doc_count": 5902, - "top-names": { - "buckets": [ - { - "key": "diago", - "doc_count": 4 - }, - { - "key": "albert", - "doc_count": 3 - }, - { - "key": "asmus", - "doc_count": 3 - }, - ... - ] - } - } - }, - { - "key": "ubuntu", - "doc_count": 8743, - "to-answers": { - "doc_count": 8784, - "top-names": { - "buckets": [ - { - "key": "ignacio", - "doc_count": 9 - }, - { - "key": "abrams", - "doc_count": 8 - }, - { - "key": "molly7244", - "doc_count": 8 - }, - ... - ] - } - } - }, - { - "key": "windows-xp", - "doc_count": 7517, - "to-answers": { - "doc_count": 13610, - "top-names": { - "buckets": [ - { - "key": "molly7244", - "doc_count": 232 - }, - { - "key": "chris", - "doc_count": 9 - }, - { - "key": "john", - "doc_count": 9 - }, - ... - ] - } - } - }, - { - "key": "networking", - "doc_count": 6739, - "to-answers": { - "doc_count": 2076, - "top-names": { - "buckets": [ - { - "key": "molly7244", - "doc_count": 6 - }, - { - "key": "alnitak", - "doc_count": 5 - }, - { - "key": "chris", - "doc_count": 3 - }, - ... - ] - } - } - }, - { - "key": "mac", - "doc_count": 5590, - "to-answers": { - "doc_count": 999, - "top-names": { - "buckets": [ - { - "key": "abrams", - "doc_count": 2 - }, - { - "key": "ignacio", - "doc_count": 2 - }, - { - "key": "vazquez", - "doc_count": 2 - }, - ... - ] - } - } - }, - { - "key": "wireless-networking", - "doc_count": 4409, - "to-answers": { - "doc_count": 6497, - "top-names": { - "buckets": [ - { - "key": "molly7244", - "doc_count": 61 - }, - { - "key": "chris", - "doc_count": 5 - }, - { - "key": "mike", - "doc_count": 5 - }, - ... - ] - } - } - }, - { - "key": "windows-8", - "doc_count": 3601, - "to-answers": { - "doc_count": 4263, - "top-names": { - "buckets": [ - { - "key": "molly7244", - "doc_count": 3 - }, - { - "key": "msft", - "doc_count": 2 - }, - { - "key": "user172132", - "doc_count": 2 - }, - ... - ] - } - } - } - ] - } - } -} --------------------------------------------------- - -<1> The number of question documents with the tag `windows-server-2003`. -<2> The number of answer documents that are related to question documents with the tag `windows-server-2003`. diff --git a/docs/reference/search/aggregations/bucket/datehistogram-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/datehistogram-aggregation.asciidoc deleted file mode 100644 index 256ef62d76..0000000000 --- a/docs/reference/search/aggregations/bucket/datehistogram-aggregation.asciidoc +++ /dev/null @@ -1,125 +0,0 @@ -[[search-aggregations-bucket-datehistogram-aggregation]] -=== Date Histogram Aggregation - -A multi-bucket aggregation similar to the <> except it can -only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible -to use the normal `histogram` on dates as well, though accuracy will be compromised. The reason for this is in the fact -that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, -we need special support for time based data. From a functionality perspective, this histogram supports the same features -as the normal <>. The main difference is that the interval can be specified by date/time expressions. - -Requesting bucket intervals of a month. - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "articles_over_time" : { - "date_histogram" : { - "field" : "date", - "interval" : "month" - } - } - } -} --------------------------------------------------- - -Available expressions for interval: `year`, `quarter`, `month`, `week`, `day`, `hour`, `minute`, `second` - - -Fractional values are allowed for seconds, minutes, hours, days and weeks. For example 1.5 hours: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "articles_over_time" : { - "date_histogram" : { - "field" : "date", - "interval" : "1.5h" - } - } - } -} --------------------------------------------------- - -See <> for accepted abbreviations. - -==== Time Zone - -By default, times are stored as UTC milliseconds since the epoch. Thus, all computation and "bucketing" / "rounding" is -done on UTC. It is possible to provide a time zone value, which will cause all bucket -computations to take place in the specified zone. The time returned for each bucket/entry is milliseconds since the -epoch in UTC. The parameters is called `time_zone`. It accepts either a numeric value for the hours offset, for example: -`"time_zone" : -2`. It also accepts a format of hours and minutes, like `"time_zone" : "-02:30"`. -Another option is to provide a time zone accepted as one of the values listed here. - -Lets take an example. For `2012-04-01T04:15:30Z` (UTC), with a `time_zone` of `"-08:00"`. For day interval, the actual time by -applying the time zone and rounding falls under `2012-03-31`, so the returned value will be (in millis) of -`2012-03-31T08:00:00Z` (UTC). For hour interval, internally applying the time zone results in `2012-03-31T20:15:30`, so rounding it -in the time zone results in `2012-03-31T20:00:00`, but we return that rounded value converted back in UTC so be consistent as -`2012-04-01T04:00:00Z` (UTC). - -==== Offset - -The `offset` option can be provided for shifting the date bucket intervals boundaries after any other shifts because of -time zones are applies. This for example makes it possible that daily buckets go from 6AM to 6AM the next day instead of starting at 12AM -or that monthly buckets go from the 10th of the month to the 10th of the next month instead of the 1st. - -The `offset` option accepts positive or negative time durations like "1h" for an hour or "1M" for a Month. See <> for more -possible time duration options. - -==== Keys - -Since internally, dates are represented as 64bit numbers, these numbers are returned as the bucket keys (each key -representing a date - milliseconds since the epoch). It is also possible to define a date format, which will result in -returning the dates as formatted strings next to the numeric key values: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "articles_over_time" : { - "date_histogram" : { - "field" : "date", - "interval" : "1M", - "format" : "yyyy-MM-dd" <1> - } - } - } -} --------------------------------------------------- - -<1> Supports expressive date <> - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "articles_over_time": { - "buckets": [ - { - "key_as_string": "2013-02-02", - "key": 1328140800000, - "doc_count": 1 - }, - { - "key_as_string": "2013-03-02", - "key": 1330646400000, - "doc_count": 2 - }, - ... - ] - } - } -} --------------------------------------------------- - -Like with the normal <>, both document level scripts and -value level scripts are supported. It is also possible to control the order of the returned buckets using the `order` -settings and filter the returned buckets based on a `min_doc_count` setting (by default all buckets between the first -bucket that matches documents and the last one are returned). This histogram also supports the `extended_bounds` -setting, which enables extending the bounds of the histogram beyond the data itself (to read more on why you'd want to -do that please refer to the explanation <>). diff --git a/docs/reference/search/aggregations/bucket/daterange-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/daterange-aggregation.asciidoc deleted file mode 100644 index 7c5d6cc86f..0000000000 --- a/docs/reference/search/aggregations/bucket/daterange-aggregation.asciidoc +++ /dev/null @@ -1,113 +0,0 @@ -[[search-aggregations-bucket-daterange-aggregation]] -=== Date Range Aggregation - -A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal <> aggregation is that the `from` and `to` values can be expressed in <> expressions, and it is also possible to specify a date format by which the `from` and `to` response fields will be returned. -Note that this aggregration includes the `from` value and excludes the `to` value for each range. - -Example: - -[source,js] --------------------------------------------------- -{ - "aggs": { - "range": { - "date_range": { - "field": "date", - "format": "MM-yyy", - "ranges": [ - { "to": "now-10M/M" }, <1> - { "from": "now-10M/M" } <2> - ] - } - } - } -} --------------------------------------------------- -<1> < now minus 10 months, rounded down to the start of the month. -<2> >= now minus 10 months, rounded down to the start of the month. - -In the example above, we created two range buckets, the first will "bucket" all documents dated prior to 10 months ago and -the second will "bucket" all documents dated since 10 months ago - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "range": { - "buckets": [ - { - "to": 1.3437792E+12, - "to_as_string": "08-2012", - "doc_count": 7 - }, - { - "from": 1.3437792E+12, - "from_as_string": "08-2012", - "doc_count": 2 - } - ] - } - } -} --------------------------------------------------- - -[[date-format-pattern]] -==== Date Format/Pattern - -NOTE: this information was copied from http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html[JodaDate] - -All ASCII letters are reserved as format pattern letters, which are defined as follows: - -[options="header"] -|======= -|Symbol |Meaning |Presentation |Examples -|G |era |text |AD -|C |century of era (>=0) |number |20 -|Y |year of era (>=0) |year |1996 - -|x |weekyear |year |1996 -|w |week of weekyear |number |27 -|e |day of week |number |2 -|E |day of week |text |Tuesday; Tue - -|y |year |year |1996 -|D |day of year |number |189 -|M |month of year |month |July; Jul; 07 -|d |day of month |number |10 - -|a |halfday of day |text |PM -|K |hour of halfday (0~11) |number |0 -|h |clockhour of halfday (1~12) |number |12 - -|H |hour of day (0~23) |number |0 -|k |clockhour of day (1~24) |number |24 -|m |minute of hour |number |30 -|s |second of minute |number |55 -|S |fraction of second |number |978 - -|z |time zone |text |Pacific Standard Time; PST -|Z |time zone offset/id |zone |-0800; -08:00; America/Los_Angeles - -|' |escape for text |delimiter -|'' |single quote |literal |' -|======= - -The count of pattern letters determine the format. - -Text:: If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available. - -Number:: The minimum number of digits. Shorter numbers are zero-padded to this amount. - -Year:: Numeric presentation for year and weekyear fields are handled specially. For example, if the count of 'y' is 2, the year will be displayed as the zero-based year of the century, which is two digits. - -Month:: 3 or over, use text, otherwise use number. - -Zone:: 'Z' outputs offset without a colon, 'ZZ' outputs the offset with a colon, 'ZZZ' or more outputs the zone id. - -Zone names:: Time zone names ('z') cannot be parsed. - -Any characters in the pattern that are not in the ranges of ['a'..'z'] and ['A'..'Z'] will be treated as quoted text. For instance, characters like ':', '.', ' ', '#' and '?' will appear in the resulting time text even they are not embraced within single quotes. diff --git a/docs/reference/search/aggregations/bucket/filter-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/filter-aggregation.asciidoc deleted file mode 100644 index cc2e104354..0000000000 --- a/docs/reference/search/aggregations/bucket/filter-aggregation.asciidoc +++ /dev/null @@ -1,38 +0,0 @@ -[[search-aggregations-bucket-filter-aggregation]] -=== Filter Aggregation - -Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents. - -Example: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "in_stock_products" : { - "filter" : { "range" : { "stock" : { "gt" : 0 } } }, - "aggs" : { - "avg_price" : { "avg" : { "field" : "price" } } - } - } - } -} --------------------------------------------------- - -In the above example, we calculate the average price of all the products that are currently in-stock. - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggs" : { - "in_stock_products" : { - "doc_count" : 100, - "avg_price" : { "value" : 56.3 } - } - } -} --------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/search/aggregations/bucket/filters-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/filters-aggregation.asciidoc deleted file mode 100644 index 2553758d77..0000000000 --- a/docs/reference/search/aggregations/bucket/filters-aggregation.asciidoc +++ /dev/null @@ -1,128 +0,0 @@ -[[search-aggregations-bucket-filters-aggregation]] -=== Filters Aggregation - -Defines a multi bucket aggregations where each bucket is associated with a -filter. Each bucket will collect all documents that match its associated -filter. - -Example: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "messages" : { - "filters" : { - "filters" : { - "errors" : { "term" : { "body" : "error" }}, - "warnings" : { "term" : { "body" : "warning" }} - } - }, - "aggs" : { - "monthly" : { - "histogram" : { - "field" : "timestamp", - "interval" : "1M" - } - } - } - } - } -} --------------------------------------------------- - -In the above example, we analyze log messages. The aggregation will build two -collection (buckets) of log messages - one for all those containing an error, -and another for all those containing a warning. And for each of these buckets -it will break them down by month. - -Response: - -[source,js] --------------------------------------------------- -... - "aggs" : { - "messages" : { - "buckets" : { - "errors" : { - "doc_count" : 34, - "monthly" : { - "buckets : [ - ... // the histogram monthly breakdown - ] - } - }, - "warnings" : { - "doc_count" : 439, - "monthly" : { - "buckets : [ - ... // the histogram monthly breakdown - ] - } - } - } - } - } - } -... --------------------------------------------------- - -==== Anonymous filters - -The filters field can also be provided as an array of filters, as in the -following request: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "messages" : { - "filters" : { - "filters" : [ - { "term" : { "body" : "error" }}, - { "term" : { "body" : "warning" }} - ] - }, - "aggs" : { - "monthly" : { - "histogram" : { - "field" : "timestamp", - "interval" : "1M" - } - } - } - } - } -} --------------------------------------------------- - -The filtered buckets are returned in the same order as provided in the -request. The response for this example would be: - -[source,js] --------------------------------------------------- -... - "aggs" : { - "messages" : { - "buckets" : [ - { - "doc_count" : 34, - "monthly" : { - "buckets : [ - ... // the histogram monthly breakdown - ] - } - }, - { - "doc_count" : 439, - "monthly" : { - "buckets : [ - ... // the histogram monthly breakdown - ] - } - } - ] - } - } -... --------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/search/aggregations/bucket/geodistance-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/geodistance-aggregation.asciidoc deleted file mode 100644 index 2120c0bec9..0000000000 --- a/docs/reference/search/aggregations/bucket/geodistance-aggregation.asciidoc +++ /dev/null @@ -1,106 +0,0 @@ -[[search-aggregations-bucket-geodistance-aggregation]] -=== Geo Distance Aggregation - -A multi-bucket aggregation that works on `geo_point` fields and conceptually works very similar to the <> aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket). - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "rings_around_amsterdam" : { - "geo_distance" : { - "field" : "location", - "origin" : "52.3760, 4.894", - "ranges" : [ - { "to" : 100 }, - { "from" : 100, "to" : 300 }, - { "from" : 300 } - ] - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "rings" : { - "buckets": [ - { - "key": "*-100.0", - "from": 0, - "to": 100.0, - "doc_count": 3 - }, - { - "key": "100.0-300.0", - "from": 100.0, - "to": 300.0, - "doc_count": 1 - }, - { - "key": "300.0-*", - "from": 300.0, - "doc_count": 7 - } - ] - } - } -} --------------------------------------------------- - -The specified field must be of type `geo_point` (which can only be set explicitly in the mappings). And it can also hold an array of `geo_point` fields, in which case all will be taken into account during aggregation. The origin point can accept all formats supported by the `geo_point` <>: - -* Object format: `{ "lat" : 52.3760, "lon" : 4.894 }` - this is the safest format as it is the most explicit about the `lat` & `lon` values -* String format: `"52.3760, 4.894"` - where the first number is the `lat` and the second is the `lon` -* Array format: `[4.894, 52.3760]` - which is based on the `GeoJson` standard and where the first number is the `lon` and the second one is the `lat` - -By default, the distance unit is `m` (metres) but it can also accept: `mi` (miles), `in` (inches), `yd` (yards), `km` (kilometers), `cm` (centimeters), `mm` (millimeters). - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "rings" : { - "geo_distance" : { - "field" : "location", - "origin" : "52.3760, 4.894", - "unit" : "mi", <1> - "ranges" : [ - { "to" : 100 }, - { "from" : 100, "to" : 300 }, - { "from" : 300 } - ] - } - } - } -} --------------------------------------------------- - -<1> The distances will be computed as miles - -There are three distance calculation modes: `sloppy_arc` (the default), `arc` (most accurate) and `plane` (fastest). The `arc` calculation is the most accurate one but also the more expensive one in terms of performance. The `sloppy_arc` is faster but less accurate. The `plane` is the fastest but least accurate distance function. Consider using `plane` when your search context is "narrow" and spans smaller geographical areas (like cities or even countries). `plane` may return higher error mergins for searches across very large areas (e.g. cross continent search). The distance calculation type can be set using the `distance_type` parameter: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "rings" : { - "geo_distance" : { - "field" : "location", - "origin" : "52.3760, 4.894", - "distance_type" : "plane", - "ranges" : [ - { "to" : 100 }, - { "from" : 100, "to" : 300 }, - { "from" : 300 } - ] - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/bucket/geohashgrid-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/geohashgrid-aggregation.asciidoc deleted file mode 100644 index e74e3e96d1..0000000000 --- a/docs/reference/search/aggregations/bucket/geohashgrid-aggregation.asciidoc +++ /dev/null @@ -1,131 +0,0 @@ -[[search-aggregations-bucket-geohashgrid-aggregation]] -=== GeoHash grid Aggregation - -A multi-bucket aggregation that works on `geo_point` fields and groups points into buckets that represent cells in a grid. -The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a http://en.wikipedia.org/wiki/Geohash[geohash] which is of user-definable precision. - -* High precision geohashes have a long string length and represent cells that cover only a small area. -* Low precision geohashes have a short string length and represent cells that each cover a large area. - -Geohashes used in this aggregation can have a choice of precision between 1 and 12. - -WARNING: The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes. -Please see the example below on how to first filter the aggregation to a smaller geographic area before requesting high-levels of detail. - -The specified field must be of type `geo_point` (which can only be set explicitly in the mappings) and it can also hold an array of `geo_point` fields, in which case all points will be taken into account during aggregation. - - -==== Simple low-precision request - -[source,js] --------------------------------------------------- -{ - "aggregations" : { - "myLarge-GrainGeoHashGrid" : { - "geohash_grid" : { - "field" : "location", - "precision" : 3 - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "myLarge-GrainGeoHashGrid": { - "buckets": [ - { - "key": "svz", - "doc_count": 10964 - }, - { - "key": "sv8", - "doc_count": 3198 - } - ] - } - } -} --------------------------------------------------- - - - -==== High-precision requests - -When requesting detailed buckets (typically for displaying a "zoomed in" map) a filter like <> should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned. - -[source,js] --------------------------------------------------- -{ - "aggregations" : { - "zoomedInView" : { - "filter" : { - "geo_bounding_box" : { - "location" : { - "top_left" : "51.73, 0.9", - "bottom_right" : "51.55, 1.1" - } - } - }, - "aggregations":{ - "zoom1":{ - "geohash_grid" : { - "field":"location", - "precision":8, - } - } - } - } - } - } --------------------------------------------------- - -==== Cell dimensions at the equator -The table below shows the metric dimensions for cells covered by various string lengths of geohash. -Cell dimensions vary with latitude and so the table is for the worst-case scenario at the equator. - -[horizontal] -*GeoHash length*:: *Area width x height* -1:: 5,009.4km x 4,992.6km -2:: 1,252.3km x 624.1km -3:: 156.5km x 156km -4:: 39.1km x 19.5km -5:: 4.9km x 4.9km -6:: 1.2km x 609.4m -7:: 152.9m x 152.4m -8:: 38.2m x 19m -9:: 4.8m x 4.8m -10:: 1.2m x 59.5cm -11:: 14.9cm x 14.9cm -12:: 3.7cm x 1.9cm - - - -==== Options - -[horizontal] -field:: Mandatory. The name of the field indexed with GeoPoints. - -precision:: Optional. The string length of the geohashes used to define - cells/buckets in the results. Defaults to 5. - -size:: Optional. The maximum number of geohash buckets to return - (defaults to 10,000). When results are trimmed, buckets are - prioritised based on the volumes of documents they contain. - A value of `0` will return all buckets that - contain a hit, use with caution as this could use a lot of CPU - and network bandwith if there are many buckets. - -shard_size:: Optional. To allow for more accurate counting of the top cells - returned in the final result the aggregation defaults to - returning `max(10,(size x number-of-shards))` buckets from each - shard. If this heuristic is undesirable, the number considered - from each shard can be over-ridden using this parameter. - A value of `0` makes the shard size unlimited. - - diff --git a/docs/reference/search/aggregations/bucket/global-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/global-aggregation.asciidoc deleted file mode 100644 index fa500e1ff8..0000000000 --- a/docs/reference/search/aggregations/bucket/global-aggregation.asciidoc +++ /dev/null @@ -1,51 +0,0 @@ -[[search-aggregations-bucket-global-aggregation]] -=== Global Aggregation - -Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you're searching on, but is *not* influenced by the search query itself. - -NOTE: Global aggregators can only be placed as top level aggregators (it makes no sense to embed a global aggregator - within another bucket aggregator) - -Example: - -[source,js] --------------------------------------------------- -{ - "query" : { - "match" : { "title" : "shirt" } - }, - "aggs" : { - "all_products" : { - "global" : {}, <1> - "aggs" : { <2> - "avg_price" : { "avg" : { "field" : "price" } } - } - } - } -} --------------------------------------------------- - -<1> The `global` aggregation has an empty body -<2> The sub-aggregations that are registered for this `global` aggregation - -The above aggregation demonstrates how one would compute aggregations (`avg_price` in this example) on all the documents in the search context, regardless of the query (in our example, it will compute the average price over all products in our catalog, not just on the "shirts"). - -The response for the above aggreation: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations" : { - "all_products" : { - "doc_count" : 100, <1> - "avg_price" : { - "value" : 56.3 - } - } - } -} --------------------------------------------------- - -<1> The number of documents that were aggregated (in our case, all documents within the search context) diff --git a/docs/reference/search/aggregations/bucket/histogram-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/histogram-aggregation.asciidoc deleted file mode 100644 index cd1fd06dda..0000000000 --- a/docs/reference/search/aggregations/bucket/histogram-aggregation.asciidoc +++ /dev/null @@ -1,319 +0,0 @@ -[[search-aggregations-bucket-histogram-aggregation]] -=== Histogram Aggregation - -A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. -It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field -that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5` -(in case of price it may represent $5). When the aggregation executes, the price field of every document will be -evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5` -then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`. -To make this more formal, here is the rounding function that is used: - -[source,java] --------------------------------------------------- -rem = value % interval -if (rem < 0) { - rem += interval -} -bucket_key = value - rem --------------------------------------------------- - -The following snippet "buckets" the products based on their `price` by interval of `50`: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50 - } - } - } -} --------------------------------------------------- - -And the following may be the response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "prices" : { - "buckets": [ - { - "key": 0, - "doc_count": 2 - }, - { - "key": 50, - "doc_count": 4 - }, - { - "key": 100, - "doc_count": 0 - }, - { - "key": 150, - "doc_count": 3 - } - ] - } - } -} --------------------------------------------------- - -==== Minimum document count - -The response above show that no documents has a price that falls within the range of `[100 - 150)`. By default the -response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with -a higher minimum count thanks to the `min_doc_count` setting: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50, - "min_doc_count" : 1 - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "prices" : { - "buckets": [ - { - "key": 0, - "doc_count": 2 - }, - { - "key": 50, - "doc_count": 4 - }, - { - "key": 150, - "doc_count": 3 - } - ] - } - } -} --------------------------------------------------- - -[[search-aggregations-bucket-histogram-aggregation-extended-bounds]] -By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with -the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the -documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when -requesting empty buckets, this causes a confusion, specifically, when the data is also filtered. - -To understand why, let's look at an example: - -Lets say the you're filtering your request to get all docs with values between `0` and `500`, in addition you'd like -to slice the data per price using a histogram with an interval of `50`. You also specify `"min_doc_count" : 0` as you'd -like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than `100`, -the first bucket you'll get will be the one with `100` as its key. This is confusing, as many times, you'd also like -to get those buckets between `0 - 100`. - -With `extended_bounds` setting, you now can "force" the histogram aggregation to start building buckets on a specific -`min` values and also keep on building buckets up to a `max` value (even if there are no documents anymore). Using -`extended_bounds` only makes sense when `min_doc_count` is 0 (the empty buckets will never be returned if `min_doc_count` -is greater than 0). - -Note that (as the name suggest) `extended_bounds` is **not** filtering buckets. Meaning, if the `extended_bounds.min` is higher -than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the -same goes for the `extended_bounds.max` and the last bucket). For filtering buckets, one should nest the histogram aggregation -under a range `filter` aggregation with the appropriate `from`/`to` settings. - -Example: - -[source,js] --------------------------------------------------- -{ - "query" : { - "filtered" : { "filter": { "range" : { "price" : { "to" : "500" } } } } - }, - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50, - "extended_bounds" : { - "min" : 0, - "max" : 500 - } - } - } - } -} --------------------------------------------------- - -==== Order - -By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controled -using the `order` setting. - -Ordering the buckets by their key - descending: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50, - "order" : { "_key" : "desc" } - } - } - } -} --------------------------------------------------- - -Ordering the buckets by their `doc_count` - ascending: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50, - "order" : { "_count" : "asc" } - } - } - } -} --------------------------------------------------- - -If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine the order of the buckets: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50, - "order" : { "price_stats.min" : "asc" } <1> - }, - "aggs" : { - "price_stats" : { "stats" : {} } <2> - } - } - } -} --------------------------------------------------- - -<1> The `{ "price_stats.min" : asc" }` will sort the buckets based on `min` value of their `price_stats` sub-aggregation. - -<2> There is no need to configure the `price` field for the `price_stats` aggregation as it will inherit it by default from its parent histogram aggregation. - -It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long -as the aggregations path are of a single-bucket type, where the last aggregation in the path may either by a single-bucket -one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`), -in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of -a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value). - -The path must be defined in the following form: - --------------------------------------------------- -AGG_SEPARATOR := '>' -METRIC_SEPARATOR := '.' -AGG_NAME := -METRIC := -PATH := []*[] --------------------------------------------------- - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50, - "order" : { "promoted_products>rating_stats.avg" : "desc" } <1> - }, - "aggs" : { - "promoted_products" : { - "filter" : { "term" : { "promoted" : true }}, - "aggs" : { - "rating_stats" : { "stats" : { "field" : "rating" }} - } - } - } - } - } -} --------------------------------------------------- - -The above will sort the buckets based on the avg rating among the promoted products - - -==== Offset - -By default the bucket keys start with 0 and then continue in even spaced steps of `interval`, e.g. if the interval is 10 the first buckets -(assuming there is data inside them) will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the `offset` option. - -This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval `10` will result in -two buckets with 5 documents each. If an additional offset `5` is used, there will be only one single bucket [5-14] containing all the 10 -documents. - -==== Response Format - -By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash -instead keyed by the buckets keys: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "prices" : { - "histogram" : { - "field" : "price", - "interval" : 50, - "keyed" : true - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "prices": { - "buckets": { - "0": { - "key": 0, - "doc_count": 2 - }, - "50": { - "key": 50, - "doc_count": 4 - }, - "150": { - "key": 150, - "doc_count": 3 - } - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/bucket/iprange-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/iprange-aggregation.asciidoc deleted file mode 100644 index 6d06743644..0000000000 --- a/docs/reference/search/aggregations/bucket/iprange-aggregation.asciidoc +++ /dev/null @@ -1,98 +0,0 @@ -[[search-aggregations-bucket-iprange-aggregation]] -=== IPv4 Range Aggregation - -Just like the dedicated <> range aggregation, there is also a dedicated range aggregation for IPv4 typed fields: - -Example: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "ip_ranges" : { - "ip_range" : { - "field" : "ip", - "ranges" : [ - { "to" : "10.0.0.5" }, - { "from" : "10.0.0.5" } - ] - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "ip_ranges": { - "buckets" : [ - { - "to": 167772165, - "to_as_string": "10.0.0.5", - "doc_count": 4 - }, - { - "from": 167772165, - "from_as_string": "10.0.0.5", - "doc_count": 6 - } - ] - } - } -} --------------------------------------------------- - -IP ranges can also be defined as CIDR masks: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "ip_ranges" : { - "ip_range" : { - "field" : "ip", - "ranges" : [ - { "mask" : "10.0.0.0/25" }, - { "mask" : "10.0.0.127/25" } - ] - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "ip_ranges": { - "buckets": [ - { - "key": "10.0.0.0/25", - "from": 1.6777216E+8, - "from_as_string": "10.0.0.0", - "to": 167772287, - "to_as_string": "10.0.0.127", - "doc_count": 127 - }, - { - "key": "10.0.0.127/25", - "from": 1.6777216E+8, - "from_as_string": "10.0.0.0", - "to": 167772287, - "to_as_string": "10.0.0.127", - "doc_count": 127 - } - ] - } - } -} --------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/search/aggregations/bucket/missing-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/missing-aggregation.asciidoc deleted file mode 100644 index f0b8fb4ac3..0000000000 --- a/docs/reference/search/aggregations/bucket/missing-aggregation.asciidoc +++ /dev/null @@ -1,34 +0,0 @@ -[[search-aggregations-bucket-missing-aggregation]] -=== Missing Aggregation - -A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values. - -Example: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "products_without_a_price" : { - "missing" : { "field" : "price" } - } - } -} --------------------------------------------------- - -In the above example, we get the total number of products that do not have a price. - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggs" : { - "products_without_a_price" : { - "doc_count" : 10 - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/bucket/nested-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/nested-aggregation.asciidoc deleted file mode 100644 index f5872bdc5d..0000000000 --- a/docs/reference/search/aggregations/bucket/nested-aggregation.asciidoc +++ /dev/null @@ -1,67 +0,0 @@ -[[search-aggregations-bucket-nested-aggregation]] -=== Nested Aggregation - -A special single bucket aggregation that enables aggregating nested documents. - -For example, lets say we have a index of products, and each product holds the list of resellers - each having its own -price for the product. The mapping could look like: - -[source,js] --------------------------------------------------- -{ - ... - - "product" : { - "properties" : { - "resellers" : { <1> - "type" : "nested", - "properties" : { - "name" : { "type" : "string" }, - "price" : { "type" : "double" } - } - } - } - } -} --------------------------------------------------- - -<1> The `resellers` is an array that holds nested documents under the `product` object. - -The following aggregations will return the minimum price products can be purchased in: - -[source,js] --------------------------------------------------- -{ - "query" : { - "match" : { "name" : "led tv" } - }, - "aggs" : { - "resellers" : { - "nested" : { - "path" : "resellers" - }, - "aggs" : { - "min_price" : { "min" : { "field" : "resellers.price" } } - } - } - } -} --------------------------------------------------- - -As you can see above, the nested aggregation requires the `path` of the nested documents within the top level documents. -Then one can define any type of aggregation over these nested documents. - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "resellers": { - "min_price": { - "value" : 350 - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/bucket/range-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/range-aggregation.asciidoc deleted file mode 100644 index f7bfcab064..0000000000 --- a/docs/reference/search/aggregations/bucket/range-aggregation.asciidoc +++ /dev/null @@ -1,277 +0,0 @@ -[[search-aggregations-bucket-range-aggregation]] -=== Range Aggregation - -A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document. -Note that this aggregration includes the `from` value and excludes the `to` value for each range. - -Example: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "price_ranges" : { - "range" : { - "field" : "price", - "ranges" : [ - { "to" : 50 }, - { "from" : 50, "to" : 100 }, - { "from" : 100 } - ] - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "price_ranges" : { - "buckets": [ - { - "to": 50, - "doc_count": 2 - }, - { - "from": 50, - "to": 100, - "doc_count": 4 - }, - { - "from": 100, - "doc_count": 4 - } - ] - } - } -} --------------------------------------------------- - -==== Keyed Response - -Setting the `keyed` flag to `true` will associate a unique string key with each bucket and return the ranges as a hash rather than an array: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "price_ranges" : { - "range" : { - "field" : "price", - "keyed" : true, - "ranges" : [ - { "to" : 50 }, - { "from" : 50, "to" : 100 }, - { "from" : 100 } - ] - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "price_ranges" : { - "buckets": { - "*-50.0": { - "to": 50, - "doc_count": 2 - }, - "50.0-100.0": { - "from": 50, - "to": 100, - "doc_count": 4 - }, - "100.0-*": { - "from": 100, - "doc_count": 4 - } - } - } - } -} --------------------------------------------------- - -It is also possible to customize the key for each range: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "price_ranges" : { - "range" : { - "field" : "price", - "keyed" : true, - "ranges" : [ - { "key" : "cheap", "to" : 50 }, - { "key" : "average", "from" : 50, "to" : 100 }, - { "key" : "expensive", "from" : 100 } - ] - } - } - } -} --------------------------------------------------- - -==== Script - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "price_ranges" : { - "range" : { - "script" : "doc['price'].value", - "ranges" : [ - { "to" : 50 }, - { "from" : 50, "to" : 100 }, - { "from" : 100 } - ] - } - } - } -} --------------------------------------------------- - -==== Value Script - -Lets say the product prices are in USD but we would like to get the price ranges in EURO. We can use value script to convert the prices prior the aggregation (assuming conversion rate of 0.8) - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "price_ranges" : { - "range" : { - "field" : "price", - "script" : "_value * conversion_rate", - "params" : { - "conversion_rate" : 0.8 - }, - "ranges" : [ - { "to" : 35 }, - { "from" : 35, "to" : 70 }, - { "from" : 70 } - ] - } - } - } -} --------------------------------------------------- - -==== Sub Aggregations - -The following example, not only "bucket" the documents to the different buckets but also computes statistics over the prices in each price range - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "price_ranges" : { - "range" : { - "field" : "price", - "ranges" : [ - { "to" : 50 }, - { "from" : 50, "to" : 100 }, - { "from" : 100 } - ] - }, - "aggs" : { - "price_stats" : { - "stats" : { "field" : "price" } - } - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "price_ranges" : { - "buckets": [ - { - "to": 50, - "doc_count": 2, - "price_stats": { - "count": 2, - "min": 20, - "max": 47, - "avg": 33.5, - "sum": 67 - } - }, - { - "from": 50, - "to": 100, - "doc_count": 4, - "price_stats": { - "count": 4, - "min": 60, - "max": 98, - "avg": 82.5, - "sum": 330 - } - }, - { - "from": 100, - "doc_count": 4, - "price_stats": { - "count": 4, - "min": 134, - "max": 367, - "avg": 216, - "sum": 864 - } - } - ] - } - } -} --------------------------------------------------- - -If a sub aggregation is also based on the same value source as the range aggregation (like the `stats` aggregation in the example above) it is possible to leave out the value source definition for it. The following will return the same response as above: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "price_ranges" : { - "range" : { - "field" : "price", - "ranges" : [ - { "to" : 50 }, - { "from" : 50, "to" : 100 }, - { "from" : 100 } - ] - }, - "aggs" : { - "price_stats" : { - "stats" : {} <1> - } - } - } - } -} --------------------------------------------------- - -<1> We don't need to specify the `price` as we "inherit" it by default from the parent `range` aggregation diff --git a/docs/reference/search/aggregations/bucket/reverse-nested-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/reverse-nested-aggregation.asciidoc deleted file mode 100644 index a25fc83733..0000000000 --- a/docs/reference/search/aggregations/bucket/reverse-nested-aggregation.asciidoc +++ /dev/null @@ -1,118 +0,0 @@ -[[search-aggregations-bucket-reverse-nested-aggregation]] -=== Reverse nested Aggregation - -A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this -aggregation can break out of the nested block structure and link to other nested structures or the root document, -which allows nesting other aggregations that aren't part of the nested object in a nested aggregation. - -The `reverse_nested` aggregation must be defined inside a `nested` aggregation. - -.Options: -* `path` - Which defines to what nested object field should be joined back. The default is empty, -which means that it joins back to the root / main document level. The path cannot contain a reference to -a nested object field that falls outside the `nested` aggregation's nested structure a `reverse_nested` is in. - -For example, lets say we have an index for a ticket system with issues and comments. The comments are inlined into -the issue documents as nested documents. The mapping could look like: - -[source,js] --------------------------------------------------- -{ - ... - - "issue" : { - "properties" : { - "tags" : { "type" : "string" } - "comments" : { <1> - "type" : "nested" - "properties" : { - "username" : { "type" : "string", "index" : "not_analyzed" }, - "comment" : { "type" : "string" } - } - } - } - } -} --------------------------------------------------- - -<1> The `comments` is an array that holds nested documents under the `issue` object. - -The following aggregations will return the top commenters' username that have commented and per top commenter the top -tags of the issues the user has commented on: - -[source,js] --------------------------------------------------- -{ - "query": { - "match": { - "name": "led tv" - } - }, - "aggs": { - "comments": { - "nested": { - "path": "comments" - }, - "aggs": { - "top_usernames": { - "terms": { - "field": "comments.username" - }, - "aggs": { - "comment_to_issue": { - "reverse_nested": {}, <1> - "aggs": { - "top_tags_per_comment": { - "terms": { - "field": "tags" - } - } - } - } - } - } - } - } - } -} --------------------------------------------------- - -As you can see above, the the `reverse_nested` aggregation is put in to a `nested` aggregation as this is the only place -in the dsl where the `reversed_nested` aggregation can be used. Its sole purpose is to join back to a parent doc higher -up in the nested structure. - -<1> A `reverse_nested` aggregation that joins back to the root / main document level, because no `path` has been defined. -Via the `path` option the `reverse_nested` aggregation can join back to a different level, if multiple layered nested -object types have been defined in the mapping - -Possible response snippet: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "comments": { - "top_usernames": { - "buckets": [ - { - "key": "username_1", - "doc_count": 12, - "comment_to_issue": { - "top_tags_per_comment": { - "buckets": [ - { - "key": "tag1", - "doc_count": 9 - }, - ... - ] - } - } - }, - ... - ] - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/bucket/sampler-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/sampler-aggregation.asciidoc deleted file mode 100644 index 5ad9dbc019..0000000000 --- a/docs/reference/search/aggregations/bucket/sampler-aggregation.asciidoc +++ /dev/null @@ -1,154 +0,0 @@ -[[search-aggregations-bucket-sampler-aggregation]] -=== Sampler Aggregation - -experimental[] - -A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents. -Optionally, diversity settings can be used to limit the number of matches that share a common value such as an "author". - -.Example use cases: -* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches -* Removing bias from analytics by ensuring fair representation of content from different sources -* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms` - - -Example: - -[source,js] --------------------------------------------------- -{ - "query": { - "match": { - "text": "iphone" - } - }, - "aggs": { - "sample": { - "sampler": { - "shard_size": 200, - "field" : "user.id" - }, - "aggs": { - "keywords": { - "significant_terms": { - "field": "text" - } - } - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - "aggregations": { - "sample": { - "doc_count": 1000,<1> - "keywords": {<2> - "doc_count": 1000, - "buckets": [ - ... - { - "key": "bend", - "doc_count": 58, - "score": 37.982536582524276, - "bg_count": 103 - }, - .... -} --------------------------------------------------- - -<1> 1000 documents were sampled in total becase we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded. -<2> The results of the significant_terms aggregation are not skewed by any single over-active Twitter user because we asked for a maximum of one tweet from any one user in our sample. - - -==== shard_size - -The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard. -The default value is 100. - -=== Controlling diversity -Optionally, you can use the `field` or `script` and `max_docs_per_value` settings to control the maximum number of documents collected on any one shard which share a common value. -The choice of value (e.g. `author`) is loaded from a regular `field` or derived dynamically by a `script`. - -The aggregation will throw an error if the choice of field or script produces multiple values for a document. -It is currently not possible to offer this form of de-duplication using many values, primarily due to concerns over efficiency. - -NOTE: Any good market researcher will tell you that when working with samples of data it is important -that the sample represents a healthy variety of opinions rather than being skewed by any single voice. -The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer). - -==== Field - -Controlling diversity using a field: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "sample" : { - "sampler" : { - "field" : "author", - "max_docs_per_value" : 3 - } - } - } -} --------------------------------------------------- - -Note that the `max_docs_per_value` setting applies on a per-shard basis only for the purposes of shard-local sampling. -It is not intended as a way of providing a global de-duplication feature on search results. - - - -==== Script - -Controlling diversity using a script: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "sample" : { - "sampler" : { - "script" : "doc['author'].value + '/' + doc['genre'].value" - } - } - } -} --------------------------------------------------- -Note in the above example we chose to use the default `max_docs_per_value` setting of 1 and combine author and genre fields to ensure -each shard sample has, at most, one match for an author/genre pair. - - -==== execution_hint - -When using the settings to control diversity, the optional `execution_hint` setting can influence the management of the values used for de-duplication. -Each option will hold up to `shard_size` values in memory while performing de-duplication but the type of value held can be controlled as follows: - - - hold field values directly (`map`) - - hold ordinals of the field as determined by the Lucene index (`global_ordinals`) - - hold hashes of the field values - with potential for hash collisions (`bytes_hash`) - -The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not. -The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions. -Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints. - -=== Limitations - -==== Cannot be nested under `breadth_first` aggregations -Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. -It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores. -In this situation an error will be thrown. - -==== Limited de-dup logic. -The de-duplication logic in the diversify settings applies only at a shard level so will not apply across shards. - -==== No specialized syntax for geo/date fields -Currently the syntax for defining the diversifying values is defined by a choice of `field` or `script` - there is no added syntactical sugar for expressing geo or date units such as "1w" (1 week). -This support may be added in a later release and users will currently have to create these sorts of values using a script. \ No newline at end of file diff --git a/docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc deleted file mode 100644 index 1e329db1df..0000000000 --- a/docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc +++ /dev/null @@ -1,524 +0,0 @@ -[[search-aggregations-bucket-significantterms-aggregation]] -=== Significant Terms Aggregation - -An aggregation that returns interesting or unusual occurrences of terms in a set. - -experimental[The `significant_terms` aggregation can be very heavy when run on large indices. Work is in progress to provide more lightweight sampling techniques. As a result, the API for this feature may change in non-backwards compatible ways] - -.Example use cases: -* Suggesting "H5N1" when users search for "bird flu" in text -* Identifying the merchant that is the "common point of compromise" from the transaction history of credit card owners reporting loss -* Suggesting keywords relating to stock symbol $ATI for an automated news classifier -* Spotting the fraudulent doctor who is diagnosing more than his fair share of whiplash injuries -* Spotting the tire manufacturer who has a disproportionate number of blow-outs - -In all these cases the terms being selected are not simply the most popular terms in a set. -They are the terms that have undergone a significant change in popularity measured between a _foreground_ and _background_ set. -If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user's search results -that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency. - -==== Single-set analysis - -In the simplest case, the _foreground_ set of interest is the search results matched by a query and the _background_ -set used for statistical comparisons is the index or indices from which the results were gathered. - -Example: - -[source,js] --------------------------------------------------- -{ - "query" : { - "terms" : {"force" : [ "British Transport Police" ]} - }, - "aggregations" : { - "significantCrimeTypes" : { - "significant_terms" : { "field" : "crime_type" } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations" : { - "significantCrimeTypes" : { - "doc_count": 47347, - "buckets" : [ - { - "key": "Bicycle theft", - "doc_count": 3640, - "score": 0.371235374214817, - "bg_count": 66799 - } - ... - ] - } - } -} --------------------------------------------------- - -When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force -stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554) -but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is -a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type. - -The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons. -To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces. - -This can be a tedious way to look for unusual patterns in an index - - - -==== Multi-set analysis -A simpler way to perform analysis across multiple categories is to use a parent-level aggregation to segment the data ready for analysis. - - -Example using a parent aggregation for segmentation: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "forces": { - "terms": {"field": "force"}, - "aggregations": { - "significantCrimeTypes": { - "significant_terms": {"field": "crime_type"} - } - } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "forces": { - "buckets": [ - { - "key": "Metropolitan Police Service", - "doc_count": 894038, - "significantCrimeTypes": { - "doc_count": 894038, - "buckets": [ - { - "key": "Robbery", - "doc_count": 27617, - "score": 0.0599, - "bg_count": 53182 - }, - ... - ] - } - }, - { - "key": "British Transport Police", - "doc_count": 47347, - "significantCrimeTypes": { - "doc_count": 47347, - "buckets": [ - { - "key": "Bicycle theft", - "doc_count": 3640, - "score": 0.371, - "bg_count": 66799 - }, - ... - ] - } - } - ] - } -} - --------------------------------------------------- - -Now we have anomaly detection for each of the police forces using a single request. - -We can use other forms of top-level aggregations to segment our data, for example segmenting by geographic -area to identify unusual hot-spots of a particular crime type: - -[source,js] --------------------------------------------------- -{ - "aggs": { - "hotspots": { - "geohash_grid" : { - "field":"location", - "precision":5, - }, - "aggs": { - "significantCrimeTypes": { - "significant_terms": {"field": "crime_type"} - } - } - } - } -} --------------------------------------------------- - -This example uses the `geohash_grid` aggregation to create result buckets that represent geographic areas, and inside each -bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g. - -* Airports exhibit unusual numbers of weapon confiscations -* Universities show uplifts of bicycle thefts - -At a higher geohash_grid zoom-level with larger coverage areas we would start to see where an entire police-force may be -tackling an unusual volume of a particular crime type. - - -Obviously a time-based top-level segmentation would help identify current trends for each point in time -where a simple `terms` aggregation would typically show the very popular "constants" that persist across all time slots. - - - -.How are the scores calculated? -********************************** -The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in _foreground_ and _background_ sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section. - -********************************** - - -==== Use on free-text fields - -The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest: - -* keywords for refining end-user searches -* keywords for use in percolator queries - -WARNING: Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt -to load every unique word into RAM. It is recommended to only use this on smaller indices. - -.Use the _"like this but not this"_ pattern -********************************** -You can spot mis-categorized content by first searching a structured field e.g. `category:adultMovie` and use significant_terms on the -free-text "movie_description" field. Take the suggested words (I'll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords. -You now have a ranked list of badly-categorized movies that you should reclassify or at least remove from the "familyFriendly" category. - -The significance score from each term can also provide a useful `boost` setting to sort matches. -Using the `minimum_should_match` setting of the `terms` query with the keywords will help control the balance of precision/recall in the result set i.e -a high setting would have a small number of relevant results packed full of keywords and a setting of "1" would produce a more exhaustive results set with all documents containing _any_ keyword. - -********************************** - -[TIP] -============ -.Show significant_terms in context - -Free-text significant_terms are much more easily understood when viewed in context. Take the results of `significant_terms` suggestions from a -free-text field and use them in a `terms` query on the same field with a `highlight` clause to present users with example snippets of documents. When the terms -are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent. -============ - -==== Custom background sets - -Ordinarily, the foreground set of documents is "diffed" against a background set of all the documents in your index. -However, sometimes it may prove useful to use a narrower background set as the basis for comparisons. -For example, a query on documents relating to "Madrid" in an index with content from all over the world might reveal that "Spanish" -was a significant term. This may be true but if you want some more focused terms you could use a `background_filter` -on the term 'spain' to establish a narrower set of documents as context. With this as a background "Spanish" would now -be seen as commonplace and therefore not as significant as words like "capital" that relate more strongly with Madrid. -Note that using a background filter will slow things down - each term's background frequency must now be derived on-the-fly from filtering posting lists rather than reading the index's pre-computed count for a term. - -==== Limitations - -===== Significant terms must be indexed values -Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes. -Because of the way the significant_terms aggregation must consider both _foreground_ and _background_ frequencies -it would be prohibitively expensive to use a script on the entire index to obtain background frequencies for comparisons. -Also DocValues are not supported as sources of term data for similar reasons. - -===== No analysis of floating point fields -Floating point fields are currently not supported as the subject of significant_terms analysis. -While integer or long fields can be used to represent concepts like bank account numbers or category numbers which -can be interesting to track, floating point fields are usually used to represent quantities of something. -As such, individual floating point terms are not useful for this form of frequency analysis. - -===== Use as a parent aggregation -If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the -top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and -so there is no difference in document frequencies to observe and from which to make sensible suggestions. - -Another consideration is that the significant_terms aggregation produces many candidate results at shard level -that are only later pruned on the reducing node once all statistics from all shards are merged. As a result, -it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms -aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of -significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations. - -===== Approximate counts -The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and -as such may be: - -* low if certain shards did not provide figures for a given term in their top sample -* high when considering the background frequency as it may count occurrences found in deleted documents - -Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies. -However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels. - -==== Parameters - -===== JLH score - -The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall. - -===== mutual information -Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter - -[source,js] --------------------------------------------------- - - "mutual_information": { - "include_negatives": true - } --------------------------------------------------- - -Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, `include_negatives` can be set to `false`. - -Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set - -[source,js] --------------------------------------------------- - -"background_is_superset": false --------------------------------------------------- - - -===== Chi square -Chi square as described in "Information Retrieval", Manning et al., Chapter 13.5.2 can be used as significance score by adding the parameter - -[source,js] --------------------------------------------------- - - "chi_square": { - } --------------------------------------------------- - -Chi square behaves like mutual information and can be configured with the same parameters `include_negatives` and `background_is_superset`. - - -===== google normalized distance -Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (http://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter - -[source,js] --------------------------------------------------- - - "gnd": { - } --------------------------------------------------- - -`gnd` also accepts the `background_is_superset` parameter. - - -===== Percentage -A simple calculation of the number of documents in the foreground sample with a term divided by the number of documents in the background with the term. -By default this produces a score greater than zero and less than one. - -The benefit of this heuristic is that the scoring logic is simple to explain to anyone familiar with a "per capita" statistic. However, for fields with high cardinality there is a tendency for this heuristic to select the rarest terms such as typos that occur only once because they score 1/1 = 100%. - -It would be hard for a seasoned boxer to win a championship if the prize was awarded purely on the basis of percentage of fights won - by these rules a newcomer with only one fight under his belt would be impossible to beat. -Multiple observations are typically required to reinforce a view so it is recommended in these cases to set both `min_doc_count` and `shard_min_doc_count` to a higher value such as 10 in order to filter out the low-frequency terms that otherwise take precedence. - -[source,js] --------------------------------------------------- - - "percentage": { - } --------------------------------------------------- - - -===== Which one is best? - - -Roughly, `mutual_information` prefers high frequent terms even if they occur also frequently in the background. For example, in an analysis of natural language text this might lead to selection of stop words. `mutual_information` is unlikely to select very rare terms like misspellings. `gnd` prefers terms with a high co-occurrence and avoids selection of stopwords. It might be better suited for synonym detection. However, `gnd` has a tendency to select very rare terms that are, for example, a result of misspelling. `chi_square` and `jlh` are somewhat in-between. - -It is hard to say which one of the different heuristics will be the best choice as it depends on what the significant terms are used for (see for example [Yang and Pedersen, "A Comparative Study on Feature Selection in Text Categorization", 1997](http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf) for a study on using significant terms for feature selection for text classification). - -If none of the above measures suits your usecase than another option is to implement a custom significance measure: - -===== scripted -Customized scores can be implemented via a script: - -[source,js] --------------------------------------------------- - - "script_heuristic": { - "script": "_subset_freq/(_superset_freq - _subset_freq + 1)" - } --------------------------------------------------- - -Scripts can be inline (as in above example), indexed or stored on disk. For details on the options, see <>. -Parameters need to be set as follows: - -[horizontal] -`script`:: Inline script, name of script file or name of indexed script. Mandatory. -`script_type`:: One of "inline" (default), "indexed" or "file". -`lang`:: Script language (default "groovy") -`params`:: Script parameters (default empty). - -Available parameters in the script are - -[horizontal] -`_subset_freq`:: Number of documents the term appears in in the subset. -`_superset_freq`:: Number of documents the term appears in in the superset. -`_subset_size`:: Number of documents in the subset. -`_superset_size`:: Number of documents in the superset. - -===== Size & Shard Size - -The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By -default, the node coordinating the search process will request each shard to provide its own top term buckets -and once all shards respond, it will reduce the results to the final list that will then be returned to the client. -If the number of unique terms is greater than `size`, the returned list can be slightly off and not accurate -(it could be that the term counts are slightly off and it could even be that a term that should have been in the top -size buckets was not returned). - -If set to `0`, the `size` will be set to `Integer.MAX_VALUE`. - -To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard -using a heuristic based on the number of shards. To take manual control of this setting the `shard_size` parameter -can be used to control the volumes of candidate terms produced by each shard. - -Low-frequency terms can turn out to be the most interesting ones once all results are combined so the -significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to -values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given -a consolidated review by the reducing node before the final selection. Obviously large candidate term lists -will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If `shard_size` is set to -1 (the default) then `shard_size` will be automatically estimated based on the number of shards and the `size` parameter. - - -If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`. - - -NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will - override it and reset it to be equal to `size`. - -===== Minimum document count - -It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "tags" : { - "significant_terms" : { - "field" : "tag", - "min_doc_count": 10 - } - } - } -} --------------------------------------------------- - -The above aggregation would only return tags which have been found in 10 hits or more. Default value is `3`. - - - - -Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic. - -`shard_min_doc_count` parameter - -The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it. - - - - -WARNING: Setting `min_doc_count` to `1` is generally not advised as it tends to return terms that - are typos or other bizarre curiosities. Finding more than one instance of a term helps - reinforce that, while still rare, the term was not the result of a one-off accident. The - default value of 3 is used to provide a minimum weight-of-evidence. - Setting `shard_min_doc_count` too high will cause significant candidate terms to be filtered out on a shard level. This value should be set much lower than `min_doc_count/#shards`. - - - -===== Custom background context - -The default source of statistical information for background term frequencies is the entire index and this -scope can be narrowed through the use of a `background_filter` to focus in on significant terms within a narrower -context: - -[source,js] --------------------------------------------------- -{ - "query" : { - "match" : "madrid" - }, - "aggs" : { - "tags" : { - "significant_terms" : { - "field" : "tag", - "background_filter": { - "term" : { "text" : "spain"} - } - } - } - } -} --------------------------------------------------- - -The above filter would help focus in on terms that were peculiar to the city of Madrid rather than revealing -terms like "Spanish" that are unusual in the full index's worldwide context but commonplace in the subset of documents containing the -word "Spain". - -WARNING: Use of background filters will slow the query as each term's postings must be filtered to determine a frequency - - -===== Filtering Values - -It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the `include` and -`exclude` parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features -described in the <> documentation. - - -===== Execution hint - - -There are different mechanisms by which terms aggregations can be executed: - - - by using field values directly in order to aggregate data per-bucket (`map`) - - by using ordinals of the field and preemptively allocating one bucket per ordinal value (`global_ordinals`) - - by using ordinals of the field and dynamically allocating one bucket per ordinal value (`global_ordinals_hash`) - -Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured. - -`map` should only be considered when very few documents match a query. Otherwise the ordinals-based execution modes -are significantly faster. By default, `map` is only used when running an aggregation on scripts, since they don't have -ordinals. - -`global_ordinals` is the second fastest option, but the fact that it preemptively allocates buckets can be memory-intensive, -especially if you have one or more sub aggregations. It is used by default on top-level terms aggregations. - -`global_ordinals_hash` on the contrary to `global_ordinals` and `global_ordinals_low_cardinality` allocates buckets dynamically -so memory usage is linear to the number of values of the documents that are part of the aggregation scope. It is used by default -in inner aggregations. - - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "tags" : { - "significant_terms" : { - "field" : "tags", - "execution_hint": "map" <1> - } - } - } -} --------------------------------------------------- - -<1> the possible values are `map`, `global_ordinals` and `global_ordinals_hash` - -Please note that Elasticsearch will ignore this execution hint if it is not applicable. - diff --git a/docs/reference/search/aggregations/bucket/terms-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/terms-aggregation.asciidoc deleted file mode 100644 index 58a6ca2449..0000000000 --- a/docs/reference/search/aggregations/bucket/terms-aggregation.asciidoc +++ /dev/null @@ -1,657 +0,0 @@ -[[search-aggregations-bucket-terms-aggregation]] -=== Terms Aggregation - -A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value. - -Example: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "genders" : { - "terms" : { "field" : "gender" } - } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations" : { - "genders" : { - "doc_count_error_upper_bound": 0, <1> - "sum_other_doc_count": 0, <2> - "buckets" : [ <3> - { - "key" : "male", - "doc_count" : 10 - }, - { - "key" : "female", - "doc_count" : 10 - }, - ] - } - } -} --------------------------------------------------- - -<1> an upper bound of the error on the document counts for each term, see <> -<2> when there are lots of unique terms, elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response -<3> the list of the top buckets, the meaning of `top` being defined by the <> - -By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can -change this default behaviour by setting the `size` parameter. - -==== Size - -The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By -default, the node coordinating the search process will request each shard to provide its own top `size` term buckets -and once all shards respond, it will reduce the results to the final list that will then be returned to the client. -This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate -(it could be that the term counts are slightly off and it could even be that a term that should have been in the top -size buckets was not returned). If set to `0`, the `size` will be set to `Integer.MAX_VALUE`. - -[[search-aggregations-bucket-terms-aggregation-approximate-counts]] -==== Document counts are approximate - -As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always -accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are -combined to give a final view. Consider the following scenario: - -A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with -3 shards. In this case each shard is asked to give its top 5 terms. - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "products" : { - "terms" : { - "field" : "product", - "size" : 5 - } - } - } -} --------------------------------------------------- - -The terms for each of the three shards are shown below with their -respective document counts in brackets: - -[width="100%",cols="^2,^2,^2,^2",options="header"] -|========================================================= -| | Shard A | Shard B | Shard C - -| 1 | Product A (25) | Product A (30) | Product A (45) -| 2 | Product B (18) | Product B (25) | Product C (44) -| 3 | Product C (6) | Product F (17) | Product Z (36) -| 4 | Product D (3) | Product Z (16) | Product G (30) -| 5 | Product E (2) | Product G (15) | Product E (29) -| 6 | Product F (2) | Product H (14) | Product H (28) -| 7 | Product G (2) | Product I (10) | Product Q (2) -| 8 | Product H (2) | Product Q (6) | Product D (1) -| 9 | Product I (1) | Product J (8) | -| 10 | Product J (1) | Product C (4) | - -|========================================================= - -The shards will return their top 5 terms so the results from the shards will be: - - -[width="100%",cols="^2,^2,^2,^2",options="header"] -|========================================================= -| | Shard A | Shard B | Shard C - -| 1 | Product A (25) | Product A (30) | Product A (45) -| 2 | Product B (18) | Product B (25) | Product C (44) -| 3 | Product C (6) | Product F (17) | Product Z (36) -| 4 | Product D (3) | Product Z (16) | Product G (30) -| 5 | Product E (2) | Product G (15) | Product E (29) - -|========================================================= - -Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces -the following: - -[width="40%",cols="^2,^2"] -|========================================================= - -| 1 | Product A (100) -| 2 | Product Z (52) -| 3 | Product C (50) -| 4 | Product G (45) -| 5 | Product B (43) - -|========================================================= - -Because Product A was returned from all shards we know that its document count value is accurate. Product C was only -returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on -shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also -returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of -combining the results to produce the final list of terms, that there is an error in the document count for Product C and -not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of -terms because it did not make it into the top five terms on any of the shards. - -==== Shard Size - -The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to -compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data -transfers between the nodes and the client). - -The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined, -it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the -coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way, -one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to -the client. If set to `0`, the `shard_size` will be set to `Integer.MAX_VALUE`. - - -NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will - override it and reset it to be equal to `size`. - -It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this -on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network. - -The default `shard_size` is a multiple of the `size` parameter which is dependant on the number of shards. - -==== Calculating Document Count Error - -There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as -a whole which represents the maximum potential document count for a term which did not make it into the final list of -terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example -given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned -could have the 4th highest document count. - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations" : { - "products" : { - "doc_count_error_upper_bound" : 46, - "buckets" : [ - { - "key" : "Product A", - "doc_count" : 100 - }, - { - "key" : "Product Z", - "doc_count" : 52 - }, - ... - ] - } - } -} --------------------------------------------------- - -==== Per bucket document count error - -experimental[] - -The second error value can be enabled by setting the `show_term_doc_count_error` parameter to true. This shows an error value -for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when -deciding on a value for the `shard_size` parameter. This is calculated by summing the document counts for the last term returned -by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as -Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document -count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by -15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count -returned is accurate. - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations" : { - "products" : { - "doc_count_error_upper_bound" : 46, - "buckets" : [ - { - "key" : "Product A", - "doc_count" : 100, - "doc_count_error_upper_bound" : 0 - }, - { - "key" : "Product Z", - "doc_count" : 52, - "doc_count_error_upper_bound" : 2 - }, - ... - ] - } - } -} --------------------------------------------------- - -These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is -ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard -does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the -aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be -determined and is given a value of -1 to indicate this. - -[[search-aggregations-bucket-terms-aggregation-order]] -==== Order - -The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by -their `doc_count` descending. It is also possible to change this behaviour as follows: - -Ordering the buckets by their `doc_count` in an ascending manner: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "genders" : { - "terms" : { - "field" : "gender", - "order" : { "_count" : "asc" } - } - } - } -} --------------------------------------------------- - -Ordering the buckets alphabetically by their terms in an ascending manner: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "genders" : { - "terms" : { - "field" : "gender", - "order" : { "_term" : "asc" } - } - } - } -} --------------------------------------------------- - - -Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name): - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "genders" : { - "terms" : { - "field" : "gender", - "order" : { "avg_height" : "desc" } - }, - "aggs" : { - "avg_height" : { "avg" : { "field" : "height" } } - } - } - } -} --------------------------------------------------- - -Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name): - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "genders" : { - "terms" : { - "field" : "gender", - "order" : { "height_stats.avg" : "desc" } - }, - "aggs" : { - "height_stats" : { "stats" : { "field" : "height" } } - } - } - } -} --------------------------------------------------- - -It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long -as the aggregations path are of a single-bucket type, where the last aggregation in the path may either be a single-bucket -one or a metrics one. If it's a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. `doc_count`), -in case it's a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of -a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value). - -The path must be defined in the following form: - --------------------------------------------------- -AGG_SEPARATOR := '>' -METRIC_SEPARATOR := '.' -AGG_NAME := -METRIC := -PATH := []*[] --------------------------------------------------- - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "countries" : { - "terms" : { - "field" : "address.country", - "order" : { "females>height_stats.avg" : "desc" } - }, - "aggs" : { - "females" : { - "filter" : { "term" : { "gender" : "female" }}, - "aggs" : { - "height_stats" : { "stats" : { "field" : "height" }} - } - } - } - } - } -} --------------------------------------------------- - -The above will sort the countries buckets based on the average height among the female population. - -Multiple criteria can be used to order the buckets by providing an array of order criteria such as the following: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "countries" : { - "terms" : { - "field" : "address.country", - "order" : [ { "females>height_stats.avg" : "desc" }, { "_count" : "desc" } ] - }, - "aggs" : { - "females" : { - "filter" : { "term" : { "gender" : { "female" }}}, - "aggs" : { - "height_stats" : { "stats" : { "field" : "height" }} - } - } - } - } - } -} --------------------------------------------------- - -The above will sort the countries buckets based on the average height among the female population and then by -their `doc_count` in descending order. - -NOTE: In the event that two buckets share the same values for all order criteria the bucket's term value is used as a -tie-breaker in ascending alphabetical order to prevent non-deterministic ordering of buckets. - -==== Minimum document count - -It is possible to only return terms that match more than a configured number of hits using the `min_doc_count` option: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "tags" : { - "terms" : { - "field" : "tags", - "min_doc_count": 10 - } - } - } -} --------------------------------------------------- - -The above aggregation would only return tags which have been found in 10 hits or more. Default value is `1`. - - -Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic. - -`shard_min_doc_count` parameter - -The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it. - - - -NOTE: Setting `min_doc_count`=`0` will also return buckets for terms that didn't match any hit. However, some of - the returned terms which have a document count of zero might only belong to deleted documents or documents - from other types, so there is no warranty that a `match_all` query would find a positive document count for - those terms. - -WARNING: When NOT sorting on `doc_count` descending, high values of `min_doc_count` may return a number of buckets - which is less than `size` because not enough data was gathered from the shards. Missing buckets can be - back by increasing `shard_size`. - Setting `shard_min_doc_count` too high will cause terms to be filtered out on a shard level. This value should be set much lower than `min_doc_count/#shards`. - -[[search-aggregations-bucket-terms-aggregation-script]] -==== Script - -Generating the terms using a script: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "genders" : { - "terms" : { - "script" : "doc['gender'].value" - } - } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - - -==== Value Script - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "genders" : { - "terms" : { - "field" : "gender", - "script" : "'Gender: ' +_value" - } - } - } -} --------------------------------------------------- - - -==== Filtering Values - -It is possible to filter the values for which buckets will be created. This can be done using the `include` and -`exclude` parameters which are based on regular expression strings or arrays of exact values. - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "tags" : { - "terms" : { - "field" : "tags", - "include" : ".*sport.*", - "exclude" : "water_.*" - } - } - } -} --------------------------------------------------- - -In the above example, buckets will be created for all the tags that has the word `sport` in them, except those starting -with `water_` (so the tag `water_sports` will no be aggregated). The `include` regular expression will determine what -values are "allowed" to be aggregated, while the `exclude` determines the values that should not be aggregated. When -both are defined, the `exclude` has precedence, meaning, the `include` is evaluated first and only then the `exclude`. - -The syntax is the same as <>. - -For matching based on exact values the `include` and `exclude` parameters can simply take an array of -strings that represent the terms as they are found in the index: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "JapaneseCars" : { - "terms" : { - "field" : "make", - "include" : ["mazda", "honda"] - } - }, - "ActiveCarManufacturers" : { - "terms" : { - "field" : "make", - "exclude" : ["rover", "jensen"] - } - } - } -} --------------------------------------------------- - -==== Multi-field terms aggregation - -The `terms` aggregation does not support collecting terms from multiple fields -in the same document. The reason is that the `terms` agg doesn't collect the -string term values themselves, but rather uses -<> -to produce a list of all of the unique values in the field. Global ordinals -results in an important performance boost which would not be possible across -multiple fields. - -There are two approaches that you can use to perform a `terms` agg across -multiple fields: - -<>:: - -Use a script to retrieve terms from multiple fields. This disables the global -ordinals optimization and will be slower than collecting terms from a single -field, but it gives you the flexibility to implement this option at search -time. - -<>:: - -If you know ahead of time that you want to collect the terms from two or more -fields, then use `copy_to` in your mapping to create a new dedicated field at -index time which contains the values from both fields. You can aggregate on -this single field, which will benefit from the global ordinals optimization. - -==== Collect mode - -Deferring calculation of child aggregations - -For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation -of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree -are expanded in one depth-first pass and only then any pruning occurs. In some rare scenarios this can be very wasteful and can hit memory constraints. -An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "actors" : { - "terms" : { - "field" : "actors", - "size" : 10 - }, - "aggs" : { - "costars" : { - "terms" : { - "field" : "actors", - "size" : 5 - } - } - } - } - } -} --------------------------------------------------- - -Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets -during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine -the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the `breadth_first` collection -mode as opposed to the default `depth_first` mode: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "actors" : { - "terms" : { - "field" : "actors", - "size" : 10, - "collect_mode" : "breadth_first" - }, - "aggs" : { - "costars" : { - "terms" : { - "field" : "actors", - "size" : 5 - } - } - } - } - } -} --------------------------------------------------- - - -When using `breadth_first` mode the set of documents that fall into the uppermost buckets are -cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents. -In most requests the volume of buckets generated is smaller than the number of documents that fall into them so the default `depth_first` -collection mode is normally the best bet but occasionally the `breadth_first` strategy can be significantly more efficient. Currently -elasticsearch will always use the `depth_first` collect_mode unless explicitly instructed to use `breadth_first` as in the above example. -Note that the `order` parameter can still be used to refer to data from a child aggregation when using the `breadth_first` setting - the parent -aggregation understands that this child aggregation will need to be called first before any of the other child aggregations. - -WARNING: It is not possible to nest aggregations such as `top_hits` which require access to match score information under an aggregation that uses -the `breadth_first` collection mode. This is because this would require a RAM buffer to hold the float score value for every document and -this would typically be too costly in terms of RAM. - -[[search-aggregations-bucket-terms-aggregation-execution-hint]] -==== Execution hint - -experimental[The automated execution optimization is experimental, so this parameter is provided temporarily as a way to override the default behaviour] - -There are different mechanisms by which terms aggregations can be executed: - - - by using field values directly in order to aggregate data per-bucket (`map`) - - by using ordinals of the field and preemptively allocating one bucket per ordinal value (`global_ordinals`) - - by using ordinals of the field and dynamically allocating one bucket per ordinal value (`global_ordinals_hash`) - - by using per-segment ordinals to compute counts and remap these counts to global counts using global ordinals (`global_ordinals_low_cardinality`) - -Elasticsearch tries to have sensible defaults so this is something that generally doesn't need to be configured. - -`map` should only be considered when very few documents match a query. Otherwise the ordinals-based execution modes -are significantly faster. By default, `map` is only used when running an aggregation on scripts, since they don't have -ordinals. - -`global_ordinals_low_cardinality` only works for leaf terms aggregations but is usually the fastest execution mode. Memory -usage is linear with the number of unique values in the field, so it is only enabled by default on low-cardinality fields. - -`global_ordinals` is the second fastest option, but the fact that it preemptively allocates buckets can be memory-intensive, -especially if you have one or more sub aggregations. It is used by default on top-level terms aggregations. - -`global_ordinals_hash` on the contrary to `global_ordinals` and `global_ordinals_low_cardinality` allocates buckets dynamically -so memory usage is linear to the number of values of the documents that are part of the aggregation scope. It is used by default -in inner aggregations. - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "tags" : { - "terms" : { - "field" : "tags", - "execution_hint": "map" <1> - } - } - } -} --------------------------------------------------- - -<1> experimental[] the possible values are `map`, `global_ordinals`, `global_ordinals_hash` and `global_ordinals_low_cardinality` - -Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints. diff --git a/docs/reference/search/aggregations/metrics.asciidoc b/docs/reference/search/aggregations/metrics.asciidoc deleted file mode 100644 index 7dbbd090bb..0000000000 --- a/docs/reference/search/aggregations/metrics.asciidoc +++ /dev/null @@ -1,27 +0,0 @@ -[[search-aggregations-metrics]] - -include::metrics/min-aggregation.asciidoc[] - -include::metrics/max-aggregation.asciidoc[] - -include::metrics/sum-aggregation.asciidoc[] - -include::metrics/avg-aggregation.asciidoc[] - -include::metrics/stats-aggregation.asciidoc[] - -include::metrics/extendedstats-aggregation.asciidoc[] - -include::metrics/valuecount-aggregation.asciidoc[] - -include::metrics/percentile-aggregation.asciidoc[] - -include::metrics/percentile-rank-aggregation.asciidoc[] - -include::metrics/cardinality-aggregation.asciidoc[] - -include::metrics/geobounds-aggregation.asciidoc[] - -include::metrics/tophits-aggregation.asciidoc[] - -include::metrics/scripted-metric-aggregation.asciidoc[] diff --git a/docs/reference/search/aggregations/metrics/avg-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/avg-aggregation.asciidoc deleted file mode 100644 index 3f029984ba..0000000000 --- a/docs/reference/search/aggregations/metrics/avg-aggregation.asciidoc +++ /dev/null @@ -1,75 +0,0 @@ -[[search-aggregations-metrics-avg-aggregation]] -=== Avg Aggregation - -A `single-value` metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. - -Assuming the data consists of documents representing exams grades (between 0 and 100) of students - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "avg_grade" : { "avg" : { "field" : "grade" } } - } -} --------------------------------------------------- - -The above aggregation computes the average grade over all documents. The aggregation type is `avg` and the `field` setting defines the numeric field of the documents the average will be computed on. The above will return the following: - - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "avg_grade": { - "value": 75 - } - } -} --------------------------------------------------- - -The name of the aggregation (`avg_grade` above) also serves as the key by which the aggregation result can be retrieved from the returned response. - -==== Script - -Computing the average grade based on a script: - -[source,js] --------------------------------------------------- -{ - ..., - - "aggs" : { - "avg_grade" : { "avg" : { "script" : "doc['grade'].value" } } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -===== Value Script - -It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new average: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - ... - - "aggs" : { - "avg_corrected_grade" : { - "avg" : { - "field" : "grade", - "script" : "_value * correction", - "params" : { - "correction" : 1.2 - } - } - } - } - } -} --------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/search/aggregations/metrics/cardinality-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/cardinality-aggregation.asciidoc deleted file mode 100644 index 07943a06c2..0000000000 --- a/docs/reference/search/aggregations/metrics/cardinality-aggregation.asciidoc +++ /dev/null @@ -1,157 +0,0 @@ -[[search-aggregations-metrics-cardinality-aggregation]] -=== Cardinality Aggregation - -A `single-value` metrics aggregation that calculates an approximate count of -distinct values. Values can be extracted either from specific fields in the -document or generated by a script. - -Assume you are indexing books and would like to count the unique authors that -match a query: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "author_count" : { - "cardinality" : { - "field" : "author" - } - } - } -} --------------------------------------------------- - -==== Precision control - -This aggregation also supports the `precision_threshold` and `rehash` options: - -experimental[The `precision_threshold` and `rehash` options are specific to the current internal implementation of the `cardinality` agg, which may change in the future] - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "author_count" : { - "cardinality" : { - "field" : "author_hash", - "precision_threshold": 100, <1> - "rehash": false <2> - } - } - } -} --------------------------------------------------- - -<1> The `precision_threshold` options allows to trade memory for accuracy, and -defines a unique count below which counts are expected to be close to -accurate. Above this value, counts might become a bit more fuzzy. The maximum -supported value is 40000, thresholds above this number will have the same -effect as a threshold of 40000. -Default value depends on the number of parent aggregations that multiple -create buckets (such as terms or histograms). -<2> If you computed a hash on client-side, stored it into your documents and want -Elasticsearch to use them to compute counts using this hash function without -rehashing values, it is possible to specify `rehash: false`. Default value is -`true`. Please note that the hash must be indexed as a long when `rehash` is -false. - -==== Counts are approximate - -Computing exact counts requires loading values into a hash set and returning its -size. This doesn't scale when working on high-cardinality sets and/or large -values as the required memory usage and the need to communicate those -per-shard sets between nodes would utilize too many resources of the cluster. - -This `cardinality` aggregation is based on the -http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++] -algorithm, which counts based on the hashes of the values with some interesting -properties: - - * configurable precision, which decides on how to trade memory for accuracy, - * excellent accuracy on low-cardinality sets, - * fixed memory usage: no matter if there are tens or billions of unique values, - memory usage only depends on the configured precision. - -For a precision threshold of `c`, the implementation that we are using requires -about `c * 8` bytes. - -The following chart shows how the error varies before and after the threshold: - -image:images/cardinality_error.png[] - -For all 3 thresholds, counts have been accurate up to the configured threshold -(although not guaranteed, this is likely to be the case). Please also note that -even with a threshold as low as 100, the error remains under 5%, even when -counting millions of items. - -==== Pre-computed hashes - -If you don't want Elasticsearch to re-compute hashes on every run of this -aggregation, it is possible to use pre-computed hashes, either by computing a -hash on client-side, indexing it and specifying `rehash: false`, or by using -the special `murmur3` field mapper, typically in the context of a `multi-field` -in the mapping: - -[source,js] --------------------------------------------------- -{ - "author": { - "type": "string", - "fields": { - "hash": { - "type": "murmur3" - } - } - } -} --------------------------------------------------- - -With such a mapping, Elasticsearch is going to compute hashes of the `author` -field at indexing time and store them in the `author.hash` field. This -way, unique counts can be computed using the cardinality aggregation by only -loading the hashes into memory, not the values of the `author` field, and -without computing hashes on the fly: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "author_count" : { - "cardinality" : { - "field" : "author.hash" - } - } - } -} --------------------------------------------------- - -NOTE: `rehash` is automatically set to `false` when computing unique counts on -a `murmur3` field. - -NOTE: Pre-computing hashes is usually only useful on very large and/or -high-cardinality fields as it saves CPU and memory. However, on numeric -fields, hashing is very fast and storing the original values requires as much -or less memory than storing the hashes. This is also true on low-cardinality -string fields, especially given that those have an optimization in order to -make sure that hashes are computed at most once per unique value per segment. - -==== Script - -The `cardinality` metric supports scripting, with a noticeable performance hit -however since hashes need to be computed on the fly. - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "author_count" : { - "cardinality" : { - "script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value" - } - } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - diff --git a/docs/reference/search/aggregations/metrics/extendedstats-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/extendedstats-aggregation.asciidoc deleted file mode 100644 index 07d25fac65..0000000000 --- a/docs/reference/search/aggregations/metrics/extendedstats-aggregation.asciidoc +++ /dev/null @@ -1,119 +0,0 @@ -[[search-aggregations-metrics-extendedstats-aggregation]] -=== Extended Stats Aggregation - -A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. - -The `extended_stats` aggregations is an extended version of the <> aggregation, where additional metrics are added such as `sum_of_squares`, `variance`, `std_deviation` and `std_deviation_bounds`. - -Assuming the data consists of documents representing exams grades (between 0 and 100) of students - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "grades_stats" : { "extended_stats" : { "field" : "grade" } } - } -} --------------------------------------------------- - -The above aggregation computes the grades statistics over all documents. The aggregation type is `extended_stats` and the `field` setting defines the numeric field of the documents the stats will be computed on. The above will return the following: - - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "grade_stats": { - "count": 9, - "min": 72, - "max": 99, - "avg": 86, - "sum": 774, - "sum_of_squares": 67028, - "variance": 51.55555555555556, - "std_deviation": 7.180219742846005, - "std_deviation_bounds": { - "upper": 100.36043948569201, - "lower": 71.63956051430799 - } - } - } -} --------------------------------------------------- - -The name of the aggregation (`grades_stats` above) also serves as the key by which the aggregation result can be retrieved from the returned response. - -==== Standard Deviation Bounds -By default, the `extended_stats` metric will return an object called `std_deviation_bounds`, which provides an interval of plus/minus two standard -deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example -three standard deviations, you can set `sigma` in the request: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "grades_stats" : { - "extended_stats" : { - "field" : "grade", - "sigma" : 3 <1> - } - } - } -} --------------------------------------------------- -<1> `sigma` controls how many standard deviations +/- from the mean should be displayed - -`sigma` can be any non-negative double, meaning you can request non-integer values such as `1.5`. A value of `0` is valid, but will simply -return the average for both `upper` and `lower` bounds. - -.Standard Deviation and Bounds require normality -[NOTE] -===== -The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must -be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so -if your data is skewed heavily left or right, the value returned will be misleading. -===== - -==== Script - -Computing the grades stats based on a script: - -[source,js] --------------------------------------------------- -{ - ..., - - "aggs" : { - "grades_stats" : { "extended_stats" : { "script" : "doc['grade'].value" } } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -===== Value Script - -It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new stats: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - ... - - "aggs" : { - "grades_stats" : { - "extended_stats" : { - "field" : "grade", - "script" : "_value * correction", - "params" : { - "correction" : 1.2 - } - } - } - } - } -} --------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/search/aggregations/metrics/geobounds-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/geobounds-aggregation.asciidoc deleted file mode 100644 index ade59477ee..0000000000 --- a/docs/reference/search/aggregations/metrics/geobounds-aggregation.asciidoc +++ /dev/null @@ -1,53 +0,0 @@ -[[search-aggregations-metrics-geobounds-aggregation]] -=== Geo Bounds Aggregation - -A metric aggregation that computes the bounding box containing all geo_point values for a field. - - -Example: - -[source,js] --------------------------------------------------- -{ - "query" : { - "match" : { "business_type" : "shop" } - }, - "aggs" : { - "viewport" : { - "geo_bounds" : { - "field" : "location", <1> - "wrap_longitude" : true <2> - } - } - } -} --------------------------------------------------- - -<1> The `geo_bounds` aggregation specifies the field to use to obtain the bounds -<2> `wrap_longitude` is an optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value is `true` - -The above aggregation demonstrates how one would compute the bounding box of the location field for all documents with a business type of shop - -The response for the above aggregation: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "viewport": { - "bounds": { - "top_left": { - "lat": 80.45, - "lon": -160.22 - }, - "bottom_right": { - "lat": 40.65, - "lon": 42.57 - } - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/metrics/max-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/max-aggregation.asciidoc deleted file mode 100644 index facefc1201..0000000000 --- a/docs/reference/search/aggregations/metrics/max-aggregation.asciidoc +++ /dev/null @@ -1,69 +0,0 @@ -[[search-aggregations-metrics-max-aggregation]] -=== Max Aggregation - -A `single-value` metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. - -Computing the max price value across all documents - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "max_price" : { "max" : { "field" : "price" } } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "max_price": { - "value": 35 - } - } -} --------------------------------------------------- - -As can be seen, the name of the aggregation (`max_price` above) also serves as the key by which the aggregation result can be retrieved from the returned response. - -==== Script - -Computing the max price value across all document, this time using a script: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "max_price" : { "max" : { "script" : "doc['price'].value" } } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -==== Value Script - -Let's say that the prices of the documents in our index are in USD, but we would like to compute the max in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "max_price_in_euros" : { - "max" : { - "field" : "price", - "script" : "_value * conversion_rate", - "params" : { - "conversion_rate" : 1.2 - } - } - } - } -} --------------------------------------------------- - diff --git a/docs/reference/search/aggregations/metrics/min-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/min-aggregation.asciidoc deleted file mode 100644 index 1383cc0832..0000000000 --- a/docs/reference/search/aggregations/metrics/min-aggregation.asciidoc +++ /dev/null @@ -1,68 +0,0 @@ -[[search-aggregations-metrics-min-aggregation]] -=== Min Aggregation - -A `single-value` metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. - -Computing the min price value across all documents: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "min_price" : { "min" : { "field" : "price" } } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "min_price": { - "value": 10 - } - } -} --------------------------------------------------- - -As can be seen, the name of the aggregation (`min_price` above) also serves as the key by which the aggregation result can be retrieved from the returned response. - -==== Script - -Computing the min price value across all document, this time using a script: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "min_price" : { "min" : { "script" : "doc['price'].value" } } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -==== Value Script - -Let's say that the prices of the documents in our index are in USD, but we would like to compute the min in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "min_price_in_euros" : { - "min" : { - "field" : "price", - "script" : "_value * conversion_rate", - "params" : { - "conversion_rate" : 1.2 - } - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/metrics/percentile-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/percentile-aggregation.asciidoc deleted file mode 100644 index 6bd1011007..0000000000 --- a/docs/reference/search/aggregations/metrics/percentile-aggregation.asciidoc +++ /dev/null @@ -1,192 +0,0 @@ -[[search-aggregations-metrics-percentile-aggregation]] -=== Percentiles Aggregation - -A `multi-value` metrics aggregation that calculates one or more percentiles -over numeric values extracted from the aggregated documents. These values -can be extracted either from specific numeric fields in the documents, or -be generated by a provided script. - -Percentiles show the point at which a certain percentage of observed values -occur. For example, the 95th percentile is the value which is greater than 95% -of the observed values. - -Percentiles are often used to find outliers. In normal distributions, the -0.13th and 99.87th percentiles represents three standard deviations from the -mean. Any data which falls outside three standard deviations is often considered -an anomaly. - -When a range of percentiles are retrieved, they can be used to estimate the -data distribution and determine if the data is skewed, bimodal, etc. - -Assume your data consists of website load times. The average and median -load times are not overly useful to an administrator. The max may be interesting, -but it can be easily skewed by a single slow response. - -Let's look at a range of percentiles representing load time: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "load_time_outlier" : { - "percentiles" : { - "field" : "load_time" <1> - } - } - } -} --------------------------------------------------- -<1> The field `load_time` must be a numeric field - -By default, the `percentile` metric will generate a range of -percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "load_time_outlier": { - "values" : { - "1.0": 15, - "5.0": 20, - "25.0": 23, - "50.0": 25, - "75.0": 29, - "95.0": 60, - "99.0": 150 - } - } - } -} --------------------------------------------------- - -As you can see, the aggregation will return a calculated value for each percentile -in the default range. If we assume response times are in milliseconds, it is -immediately obvious that the webpage normally loads in 15-30ms, but occasionally -spikes to 60-150ms. - -Often, administrators are only interested in outliers -- the extreme percentiles. -We can specify just the percents we are interested in (requested percentiles -must be a value between 0-100 inclusive): - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "load_time_outlier" : { - "percentiles" : { - "field" : "load_time", - "percents" : [95, 99, 99.9] <1> - } - } - } -} --------------------------------------------------- -<1> Use the `percents` parameter to specify particular percentiles to calculate - - - -==== Script - -The percentile metric supports scripting. For example, if our load times -are in milliseconds but we want percentiles calculated in seconds, we could use -a script to convert them on-the-fly: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "load_time_outlier" : { - "percentiles" : { - "script" : "doc['load_time'].value / timeUnit", <1> - "params" : { - "timeUnit" : 1000 <2> - } - } - } - } -} --------------------------------------------------- -<1> The `field` parameter is replaced with a `script` parameter, which uses the -script to generate values which percentiles are calculated on -<2> Scripting supports parameterized input just like any other script - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -[[search-aggregations-metrics-percentile-aggregation-approximation]] -==== Percentiles are (usually) approximate - -There are many different algorithms to calculate percentiles. The naive -implementation simply stores all the values in a sorted array. To find the 50th -percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`. - -Clearly, the naive implementation does not scale -- the sorted array grows -linearly with the number of values in your dataset. To calculate percentiles -across potentially billions of values in an Elasticsearch cluster, _approximate_ -percentiles are calculated. - -The algorithm used by the `percentile` metric is called TDigest (introduced by -Ted Dunning in -https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]). - -When using this metric, there are a few guidelines to keep in mind: - -- Accuracy is proportional to `q(1-q)`. This means that extreme percentiles (e.g. 99%) -are more accurate than less extreme percentiles, such as the median -- For small sets of values, percentiles are highly accurate (and potentially -100% accurate if the data is small enough). -- As the quantity of values in a bucket grows, the algorithm begins to approximate -the percentiles. It is effectively trading accuracy for memory savings. The -exact level of inaccuracy is difficult to generalize, since it depends on your -data distribution and volume of data being aggregated - -The following chart shows the relative error on a uniform distribution depending -on the number of collected values and the requested percentile: - -image:images/percentiles_error.png[] - -It shows how precision is better for extreme percentiles. The reason why error diminishes -for large number of values is that the law of large numbers makes the distribution of -values more and more uniform and the t-digest tree can do a better job at summarizing -it. It would not be the case on more skewed distributions. - -[[search-aggregations-metrics-percentile-aggregation-compression]] -==== Compression - -experimental[The `compression` parameter is specific to the current internal implementation of percentiles, and may change in the future] - -Approximate algorithms must balance memory utilization with estimation accuracy. -This balance can be controlled using a `compression` parameter: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "load_time_outlier" : { - "percentiles" : { - "field" : "load_time", - "compression" : 200 <1> - } - } - } -} --------------------------------------------------- -<1> Compression controls memory usage and approximation error - -The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the -more nodes available, the higher the accuracy (and large memory footprint) proportional -to the volume of data. The `compression` parameter limits the maximum number of -nodes to `20 * compression`. - -Therefore, by increasing the compression value, you can increase the accuracy of -your percentiles at the cost of more memory. Larger compression values also -make the algorithm slower since the underlying tree data structure grows in size, -resulting in more expensive operations. The default compression value is -`100`. - -A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount -of data which arrives sorted and in-order) the default settings will produce a -TDigest roughly 64KB in size. In practice data tends to be more random and -the TDigest will use less memory. diff --git a/docs/reference/search/aggregations/metrics/percentile-rank-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/percentile-rank-aggregation.asciidoc deleted file mode 100644 index d327fc6630..0000000000 --- a/docs/reference/search/aggregations/metrics/percentile-rank-aggregation.asciidoc +++ /dev/null @@ -1,88 +0,0 @@ -[[search-aggregations-metrics-percentile-rank-aggregation]] -=== Percentile Ranks Aggregation - -A `multi-value` metrics aggregation that calculates one or more percentile ranks -over numeric values extracted from the aggregated documents. These values -can be extracted either from specific numeric fields in the documents, or -be generated by a provided script. - -[NOTE] -================================================== -Please see <> -and <> for advice -regarding approximation and memory use of the percentile ranks aggregation -================================================== - -Percentile rank show the percentage of observed values which are below certain -value. For example, if a value is greater than or equal to 95% of the observed values -it is said to be at the 95th percentile rank. - -Assume your data consists of website load times. You may have a service agreement that -95% of page loads completely within 15ms and 99% of page loads complete within 30ms. - -Let's look at a range of percentiles representing load time: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "load_time_outlier" : { - "percentile_ranks" : { - "field" : "load_time", <1> - "values" : [15, 30] - } - } - } -} --------------------------------------------------- -<1> The field `load_time` must be a numeric field - -The response will look like this: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "load_time_outlier": { - "values" : { - "15": 92, - "30": 100 - } - } - } -} --------------------------------------------------- - -From this information you can determine you are hitting the 99% load time target but not quite -hitting the 95% load time target - - -==== Script - -The percentile rank metric supports scripting. For example, if our load times -are in milliseconds but we want to specify values in seconds, we could use -a script to convert them on-the-fly: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "load_time_outlier" : { - "percentile_ranks" : { - "values" : [3, 5], - "script" : "doc['load_time'].value / timeUnit", <1> - "params" : { - "timeUnit" : 1000 <2> - } - } - } - } -} --------------------------------------------------- -<1> The `field` parameter is replaced with a `script` parameter, which uses the -script to generate values which percentile ranks are calculated on -<2> Scripting supports parameterized input just like any other script - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. diff --git a/docs/reference/search/aggregations/metrics/scripted-metric-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/scripted-metric-aggregation.asciidoc deleted file mode 100644 index a775d54540..0000000000 --- a/docs/reference/search/aggregations/metrics/scripted-metric-aggregation.asciidoc +++ /dev/null @@ -1,237 +0,0 @@ -[[search-aggregations-metrics-scripted-metric-aggregation]] -=== Scripted Metric Aggregation - -experimental[] - -A metric aggregation that executes using scripts to provide a metric output. - -Example: - -[source,js] --------------------------------------------------- -{ - "query" : { - "match_all" : {} - }, - "aggs": { - "profit": { - "scripted_metric": { - "init_script" : "_agg['transactions'] = []", - "map_script" : "if (doc['type'].value == \"sale\") { _agg.transactions.add(doc['amount'].value) } else { _agg.transactions.add(-1 * doc['amount'].value) }", <1> - "combine_script" : "profit = 0; for (t in _agg.transactions) { profit += t }; return profit", - "reduce_script" : "profit = 0; for (a in _aggs) { profit += a }; return profit" - } - } - } -} --------------------------------------------------- - -<1> `map_script` is the only required parameter - -The above aggregation demonstrates how one would use the script aggregation compute the total profit from sale and cost transactions. - -The response for the above aggregation: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "profit": { - "value": 170 - } - } -} --------------------------------------------------- - -==== Scope of scripts - -The scripted metric aggregation uses scripts at 4 stages of its execution: - -init_script:: Executed prior to any collection of documents. Allows the aggregation to set up any initial state. -+ -In the above example, the `init_script` creates an array `transactions` in the `_agg` object. - -map_script:: Executed once per document collected. This is the only required script. If no combine_script is specified, the resulting state - needs to be stored in an object named `_agg`. -+ -In the above example, the `map_script` checks the value of the type field. If the value if 'sale' the value of the amount field -is added to the transactions array. If the value of the type field is not 'sale' the negated value of the amount field is added -to transactions. - -combine_script:: Executed once on each shard after document collection is complete. Allows the aggregation to consolidate the state returned from - each shard. If a combine_script is not provided the combine phase will return the aggregation variable. -+ -In the above example, the `combine_script` iterates through all the stored transactions, summing the values in the `profit` variable -and finally returns `profit`. - -reduce_script:: Executed once on the coordinating node after all shards have returned their results. The script is provided with access to a - variable `_aggs` which is an array of the result of the combine_script on each shard. If a reduce_script is not provided - the reduce phase will return the `_aggs` variable. -+ -In the above example, the `reduce_script` iterates through the `profit` returned by each shard summing the values before returning the -final combined profit which will be returned in the response of the aggregation. - -==== Worked Example - -Imagine a situation where you index the following documents into and index with 2 shards: - -[source,js] --------------------------------------------------- -$ curl -XPUT 'http://localhost:9200/transactions/stock/1' -d ' -{ - "type": "sale", - "amount": 80 -} -' - -$ curl -XPUT 'http://localhost:9200/transactions/stock/2' -d ' -{ - "type": "cost", - "amount": 10 -} -' - -$ curl -XPUT 'http://localhost:9200/transactions/stock/3' -d ' -{ - "type": "cost", - "amount": 30 -} -' - -$ curl -XPUT 'http://localhost:9200/transactions/stock/4' -d ' -{ - "type": "sale", - "amount": 130 -} -' --------------------------------------------------- - -Lets say that documents 1 and 3 end up on shard A and documents 2 and 4 end up on shard B. The following is a breakdown of what the aggregation result is -at each stage of the example above. - -===== Before init_script - -No params object was specified so the default params object is used: - -[source,js] --------------------------------------------------- -"params" : { - "_agg" : {} -} --------------------------------------------------- - -===== After init_script - -This is run once on each shard before any document collection is performed, and so we will have a copy on each shard: - -Shard A:: -+ -[source,js] --------------------------------------------------- -"params" : { - "_agg" : { - "transactions" : [] - } -} --------------------------------------------------- - -Shard B:: -+ -[source,js] --------------------------------------------------- -"params" : { - "_agg" : { - "transactions" : [] - } -} --------------------------------------------------- - -===== After map_script - -Each shard collects its documents and runs the map_script on each document that is collected: - -Shard A:: -+ -[source,js] --------------------------------------------------- -"params" : { - "_agg" : { - "transactions" : [ 80, -30 ] - } -} --------------------------------------------------- - -Shard B:: -+ -[source,js] --------------------------------------------------- -"params" : { - "_agg" : { - "transactions" : [ -10, 130 ] - } -} --------------------------------------------------- - -===== After combine_script - -The combine_script is executed on each shard after document collection is complete and reduces all the transactions down to a single profit figure for each -shard (by summing the values in the transactions array) which is passed back to the coordinating node: - -Shard A:: 50 -Shard B:: 120 - -===== After reduce_script - -The reduce_script receives an `_aggs` array containing the result of the combine script for each shard: - -[source,js] --------------------------------------------------- -"_aggs" : [ - 50, - 120 -] --------------------------------------------------- - -It reduces the responses for the shards down to a final overall profit figure (by summing the values) and returns this as the result of the aggregation to -produce the response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "profit": { - "value": 170 - } - } -} --------------------------------------------------- - -==== Other Parameters - -[horizontal] -params:: Optional. An object whose contents will be passed as variables to the `init_script`, `map_script` and `combine_script`. This can be - useful to allow the user to control the behavior of the aggregation and for storing state between the scripts. If this is not specified, - the default is the equivalent of providing: -+ -[source,js] --------------------------------------------------- -"params" : { - "_agg" : {} -} --------------------------------------------------- -reduce_params:: Optional. An object whose contents will be passed as variables to the `reduce_script`. This can be useful to allow the user to control - the behavior of the reduce phase. If this is not specified the variable will be undefined in the reduce_script execution. -lang:: Optional. The script language used for the scripts. If this is not specified the default scripting language is used. -init_script_file:: Optional. Can be used in place of the `init_script` parameter to provide the script using in a file. -init_script_id:: Optional. Can be used in place of the `init_script` parameter to provide the script using an indexed script. -map_script_file:: Optional. Can be used in place of the `map_script` parameter to provide the script using in a file. -map_script_id:: Optional. Can be used in place of the `map_script` parameter to provide the script using an indexed script. -combine_script_file:: Optional. Can be used in place of the `combine_script` parameter to provide the script using in a file. -combine_script_id:: Optional. Can be used in place of the `combine_script` parameter to provide the script using an indexed script. -reduce_script_file:: Optional. Can be used in place of the `reduce_script` parameter to provide the script using in a file. -reduce_script_id:: Optional. Can be used in place of the `reduce_script` parameter to provide the script using an indexed script. - diff --git a/docs/reference/search/aggregations/metrics/stats-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/stats-aggregation.asciidoc deleted file mode 100644 index 7fbdecd601..0000000000 --- a/docs/reference/search/aggregations/metrics/stats-aggregation.asciidoc +++ /dev/null @@ -1,81 +0,0 @@ -[[search-aggregations-metrics-stats-aggregation]] -=== Stats Aggregation - -A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. - -The stats that are returned consist of: `min`, `max`, `sum`, `count` and `avg`. - -Assuming the data consists of documents representing exams grades (between 0 and 100) of students - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "grades_stats" : { "stats" : { "field" : "grade" } } - } -} --------------------------------------------------- - -The above aggregation computes the grades statistics over all documents. The aggregation type is `stats` and the `field` setting defines the numeric field of the documents the stats will be computed on. The above will return the following: - - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "grades_stats": { - "count": 6, - "min": 60, - "max": 98, - "avg": 78.5, - "sum": 471 - } - } -} --------------------------------------------------- - -The name of the aggregation (`grades_stats` above) also serves as the key by which the aggregation result can be retrieved from the returned response. - -==== Script - -Computing the grades stats based on a script: - -[source,js] --------------------------------------------------- -{ - ..., - - "aggs" : { - "grades_stats" : { "stats" : { "script" : "doc['grade'].value" } } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -===== Value Script - -It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use a value script to get the new stats: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - ... - - "aggs" : { - "grades_stats" : { - "stats" : { - "field" : "grade", - "script" : "_value * correction", - "params" : { - "correction" : 1.2 - } - } - } - } - } -} --------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/search/aggregations/metrics/sum-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/sum-aggregation.asciidoc deleted file mode 100644 index 8857ff306e..0000000000 --- a/docs/reference/search/aggregations/metrics/sum-aggregation.asciidoc +++ /dev/null @@ -1,79 +0,0 @@ -[[search-aggregations-metrics-sum-aggregation]] -=== Sum Aggregation - -A `single-value` metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. - -Assuming the data consists of documents representing stock ticks, where each tick holds the change in the stock price from the previous tick. - -[source,js] --------------------------------------------------- -{ - "query" : { - "filtered" : { - "query" : { "match_all" : {}}, - "filter" : { - "range" : { "timestamp" : { "from" : "now/1d+9.5h", "to" : "now/1d+16h" }} - } - } - }, - "aggs" : { - "intraday_return" : { "sum" : { "field" : "change" } } - } -} --------------------------------------------------- - -The above aggregation sums up all changes in the today's trading stock ticks which accounts for the intraday return. The aggregation type is `sum` and the `field` setting defines the numeric field of the documents of which values will be summed up. The above will return the following: - - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "intraday_return": { - "value": 2.18 - } - } -} --------------------------------------------------- - -The name of the aggregation (`intraday_return` above) also serves as the key by which the aggregation result can be retrieved from the returned response. - -==== Script - -Computing the intraday return based on a script: - -[source,js] --------------------------------------------------- -{ - ..., - - "aggs" : { - "intraday_return" : { "sum" : { "script" : "doc['change'].value" } } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. - -===== Value Script - -Computing the sum of squares over all stock tick changes: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - ... - - "aggs" : { - "daytime_return" : { - "sum" : { - "field" : "change", - "script" : "_value * _value" } - } - } - } -} --------------------------------------------------- diff --git a/docs/reference/search/aggregations/metrics/tophits-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/tophits-aggregation.asciidoc deleted file mode 100644 index b6e9c2caba..0000000000 --- a/docs/reference/search/aggregations/metrics/tophits-aggregation.asciidoc +++ /dev/null @@ -1,275 +0,0 @@ -[[search-aggregations-metrics-top-hits-aggregation]] -=== Top hits Aggregation - -A `top_hits` metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended -to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket. - -The `top_hits` aggregator can effectively be used to group result sets by certain fields via a bucket aggregator. -One or more bucket aggregators determines by which properties a result set get sliced into. - -==== Options - -* `from` - The offset from the first result you want to fetch. -* `size` - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned. -* `sort` - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query. - -==== Supported per hit features - -The top_hits aggregation returns regular search hits, because of this many per hit features can be supported: - -* <> -* <> -* <> -* <> -* <> -* <> -* <> - -==== Example - -In the following example we group the questions by tag and per tag we show the last active question. For each question -only the title field is being included in the source. - -[source,js] --------------------------------------------------- -{ - "aggs": { - "top-tags": { - "terms": { - "field": "tags", - "size": 3 - }, - "aggs": { - "top_tag_hits": { - "top_hits": { - "sort": [ - { - "last_activity_date": { - "order": "desc" - } - } - ], - "_source": { - "include": [ - "title" - ] - }, - "size" : 1 - } - } - } - } - } -} --------------------------------------------------- - -Possible response snippet: - -[source,js] --------------------------------------------------- -"aggregations": { - "top-tags": { - "buckets": [ - { - "key": "windows-7", - "doc_count": 25365, - "top_tags_hits": { - "hits": { - "total": 25365, - "max_score": 1, - "hits": [ - { - "_index": "stack", - "_type": "question", - "_id": "602679", - "_score": 1, - "_source": { - "title": "Windows port opening" - }, - "sort": [ - 1370143231177 - ] - } - ] - } - } - }, - { - "key": "linux", - "doc_count": 18342, - "top_tags_hits": { - "hits": { - "total": 18342, - "max_score": 1, - "hits": [ - { - "_index": "stack", - "_type": "question", - "_id": "602672", - "_score": 1, - "_source": { - "title": "Ubuntu RFID Screensaver lock-unlock" - }, - "sort": [ - 1370143379747 - ] - } - ] - } - } - }, - { - "key": "windows", - "doc_count": 18119, - "top_tags_hits": { - "hits": { - "total": 18119, - "max_score": 1, - "hits": [ - { - "_index": "stack", - "_type": "question", - "_id": "602678", - "_score": 1, - "_source": { - "title": "If I change my computers date / time, what could be affected?" - }, - "sort": [ - 1370142868283 - ] - } - ] - } - } - } - ] - } -} --------------------------------------------------- - -==== Field collapse example - -Field collapsing or result grouping is a feature that logically groups a result set into groups and per group returns -top documents. The ordering of the groups is determined by the relevancy of the first document in a group. In -Elasticsearch this can be implemented via a bucket aggregator that wraps a `top_hits` aggregator as sub-aggregator. - -In the example below we search across crawled webpages. For each webpage we store the body and the domain the webpage -belong to. By defining a `terms` aggregator on the `domain` field we group the result set of webpages by domain. The -`top_docs` aggregator is then defined as sub-aggregator, so that the top matching hits are collected per bucket. - -Also a `max` aggregator is defined which is used by the `terms` aggregator's order feature the return the buckets by -relevancy order of the most relevant document in a bucket. - -[source,js] --------------------------------------------------- -{ - "query": { - "match": { - "body": "elections" - } - }, - "aggs": { - "top-sites": { - "terms": { - "field": "domain", - "order": { - "top_hit": "desc" - } - }, - "aggs": { - "top_tags_hits": { - "top_hits": {} - }, - "top_hit" : { - "max": { - "script": "_score" - } - } - } - } - } -} --------------------------------------------------- - -At the moment the `max` (or `min`) aggregator is needed to make sure the buckets from the `terms` aggregator are -ordered according to the score of the most relevant webpage per domain. The `top_hits` aggregator isn't a metric aggregator -and therefore can't be used in the `order` option of the `terms` aggregator. - -==== top_hits support in a nested or reverse_nested aggregator - -If the `top_hits` aggregator is wrapped in a `nested` or `reverse_nested` aggregator then nested hits are being returned. -Nested hits are in a sense hidden mini documents that are part of regular document where in the mapping a nested field type -has been configured. The `top_hits` aggregator has the ability to un-hide these documents if it is wrapped in a `nested` -or `reverse_nested` aggregator. Read more about nested in the <>. - -If nested type has been configured a single document is actually indexed as multiple Lucene documents and they share -the same id. In order to determine the identity of a nested hit there is more needed than just the id, so that is why -nested hits also include their nested identity. The nested identity is kept under the `_nested` field in the search hit -and includes the array field and the offset in the array field the nested hit belongs to. The offset is zero based. - -Top hits response snippet with a nested hit, which resides in the third slot of array field `nested_field1` in document with id `1`: - -[source,js] --------------------------------------------------- -... -"hits": { - "total": 25365, - "max_score": 1, - "hits": [ - { - "_index": "a", - "_type": "b", - "_id": "1", - "_score": 1, - "_nested" : { - "field" : "nested_field1", - "offset" : 2 - } - "_source": ... - }, - ... - ] -} -... --------------------------------------------------- - -If `_source` is requested then just the part of the source of the nested object is returned, not the entire source of the document. -Also stored fields on the *nested* inner object level are accessible via `top_hits` aggregator residing in a `nested` or `reverse_nested` aggregator. - -Only nested hits will have a `_nested` field in the hit, non nested (regular) hits will not have a `_nested` field. - -The information in `_nested` can also be used to parse the original source somewhere else if `_source` isn't enabled. - -If there are multiple levels of nested object types defined in mappings then the `_nested` information can also be hierarchical -in order to express the identity of nested hits that are two layers deep or more. - -In the example below a nested hit resides in the first slot of the field `nested_grand_child_field` which then resides in -the second slow of the `nested_child_field` field: - -[source,js] --------------------------------------------------- -... -"hits": { - "total": 2565, - "max_score": 1, - "hits": [ - { - "_index": "a", - "_type": "b", - "_id": "1", - "_score": 1, - "_nested" : { - "field" : "nested_child_field", - "offset" : 1, - "_nested" : { - "field" : "nested_grand_child_field", - "offset" : 0 - } - } - "_source": ... - }, - ... - ] -} -... --------------------------------------------------- \ No newline at end of file diff --git a/docs/reference/search/aggregations/metrics/valuecount-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/valuecount-aggregation.asciidoc deleted file mode 100644 index ed5e23ee33..0000000000 --- a/docs/reference/search/aggregations/metrics/valuecount-aggregation.asciidoc +++ /dev/null @@ -1,51 +0,0 @@ -[[search-aggregations-metrics-valuecount-aggregation]] -=== Value Count Aggregation - -A `single-value` metrics aggregation that counts the number of values that are extracted from the aggregated documents. -These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically, -this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the `avg` -one might be interested in the number of values the average is computed over. - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "grades_count" : { "value_count" : { "field" : "grade" } } - } -} --------------------------------------------------- - -Response: - -[source,js] --------------------------------------------------- -{ - ... - - "aggregations": { - "grades_count": { - "value": 10 - } - } -} --------------------------------------------------- - -The name of the aggregation (`grades_count` above) also serves as the key by which the aggregation result can be -retrieved from the returned response. - -==== Script - -Counting the values generated by a script: - -[source,js] --------------------------------------------------- -{ - ..., - - "aggs" : { - "grades_count" : { "value_count" : { "script" : "doc['grade'].value" } } - } -} --------------------------------------------------- - -TIP: The `script` parameter expects an inline script. Use `script_id` for indexed scripts and `script_file` for scripts in the `config/scripts/` directory. diff --git a/docs/reference/search/aggregations/reducer.asciidoc b/docs/reference/search/aggregations/reducer.asciidoc deleted file mode 100644 index a725bc77e3..0000000000 --- a/docs/reference/search/aggregations/reducer.asciidoc +++ /dev/null @@ -1,6 +0,0 @@ -[[search-aggregations-reducer]] - -include::reducer/derivative-aggregation.asciidoc[] -include::reducer/max-bucket-aggregation.asciidoc[] -include::reducer/min-bucket-aggregation.asciidoc[] -include::reducer/movavg-aggregation.asciidoc[] diff --git a/docs/reference/search/aggregations/reducer/derivative-aggregation.asciidoc b/docs/reference/search/aggregations/reducer/derivative-aggregation.asciidoc deleted file mode 100644 index be644091b5..0000000000 --- a/docs/reference/search/aggregations/reducer/derivative-aggregation.asciidoc +++ /dev/null @@ -1,194 +0,0 @@ -[[search-aggregations-reducer-derivative-aggregation]] -=== Derivative Aggregation - -A parent reducer aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram) -aggregation. The specified metric must be numeric and the enclosing histogram must have `min_doc_count` set to `0` (default -for `histogram` aggregations). - -The following snippet calculates the derivative of the total monthly `sales`: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "sales_per_month" : { - "date_histogram" : { - "field" : "date", - "interval" : "month" - }, - "aggs": { - "sales": { - "sum": { - "field": "price" - } - }, - "sales_deriv": { - "derivative": { - "buckets_paths": "sales" <1> - } - } - } - } - } -} --------------------------------------------------- - -<1> `bucket_paths` instructs this derivative aggregation to use the output of the `sales` aggregation for the derivative - -And the following may be the response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "sales_per_month": { - "buckets": [ - { - "key_as_string": "2015/01/01 00:00:00", - "key": 1420070400000, - "doc_count": 3, - "sales": { - "value": 550 - } <1> - }, - { - "key_as_string": "2015/02/01 00:00:00", - "key": 1422748800000, - "doc_count": 2, - "sales": { - "value": 60 - }, - "sales_deriv": { - "value": -490 <2> - } - }, - { - "key_as_string": "2015/03/01 00:00:00", - "key": 1425168000000, - "doc_count": 2, <3> - "sales": { - "value": 375 - }, - "sales_deriv": { - "value": 315 - } - } - ] - } - } -} --------------------------------------------------- - -<1> No derivative for the first bucket since we need at least 2 data points to calculate the derivative -<2> Derivative value units are implicitly defined by the `sales` aggregation and the parent histogram so in this case the units -would be $/month assuming the `price` field has units of $. -<3> The number of documents in the bucket are represented by the `doc_count` value - -==== Second Order Derivative - -A second order derivative can be calculated by chaining the derivative reducer aggregation onto the result of another derivative -reducer aggregation as in the following example which will calculate both the first and the second order derivative of the total -monthly sales: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "sales_per_month" : { - "date_histogram" : { - "field" : "date", - "interval" : "month" - }, - "aggs": { - "sales": { - "sum": { - "field": "price" - } - }, - "sales_deriv": { - "derivative": { - "buckets_paths": "sales" - } - }, - "sales_2nd_deriv": { - "derivative": { - "buckets_paths": "sales_deriv" <1> - } - } - } - } - } -} --------------------------------------------------- - -<1> `bucket_paths` for the second derivative points to the name of the first derivative - -And the following may be the response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "sales_per_month": { - "buckets": [ - { - "key_as_string": "2015/01/01 00:00:00", - "key": 1420070400000, - "doc_count": 3, - "sales": { - "value": 550 - } <1> - }, - { - "key_as_string": "2015/02/01 00:00:00", - "key": 1422748800000, - "doc_count": 2, - "sales": { - "value": 60 - }, - "sales_deriv": { - "value": -490 - } <1> - }, - { - "key_as_string": "2015/03/01 00:00:00", - "key": 1425168000000, - "doc_count": 2, - "sales": { - "value": 375 - }, - "sales_deriv": { - "value": 315 - }, - "sales_2nd_deriv": { - "value": 805 - } - } - ] - } - } -} --------------------------------------------------- -<1> No second derivative for the first two buckets since we need at least 2 data points from the first derivative to calculate the -second derivative - -==== Dealing with gaps in the data - -There are a couple of reasons why the data output by the enclosing histogram may have gaps: - -* There are no documents matching the query for some buckets -* The data for a metric is missing in all of the documents falling into a bucket (this is most likely with either a small interval -on the enclosing histogram or with a query matching only a small number of documents) - -Where there is no data available in a bucket for a given metric it presents a problem for calculating the derivative value for both -the current bucket and the next bucket. In the derivative reducer aggregation has a `gap_policy` parameter to define what the behavior -should be when a gap in the data is found. There are currently two options for controlling the gap policy: - -_ignore_:: - This option will not produce a derivative value for any buckets where the value in the current or previous bucket is - missing - -_insert_zeros_:: - This option will assume the missing value is `0` and calculate the derivative with the value `0`. - - diff --git a/docs/reference/search/aggregations/reducer/max-bucket-aggregation.asciidoc b/docs/reference/search/aggregations/reducer/max-bucket-aggregation.asciidoc deleted file mode 100644 index a93c7ed803..0000000000 --- a/docs/reference/search/aggregations/reducer/max-bucket-aggregation.asciidoc +++ /dev/null @@ -1,82 +0,0 @@ -[[search-aggregations-reducer-max-bucket-aggregation]] -=== Max Bucket Aggregation - -A sibling reducer aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibing aggregation -and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must -be a multi-bucket aggregation. - -The following snippet calculates the maximum of the total monthly `sales`: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "sales_per_month" : { - "date_histogram" : { - "field" : "date", - "interval" : "month" - }, - "aggs": { - "sales": { - "sum": { - "field": "price" - } - } - } - }, - "max_monthly_sales": { - "max_bucket": { - "buckets_paths": "sales_per_month>sales" <1> - } - } - } -} --------------------------------------------------- - -<1> `bucket_paths` instructs this max_bucket aggregation that we want the maximum value of the `sales` aggregation in the -`sales_per_month` date histogram. - -And the following may be the response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "sales_per_month": { - "buckets": [ - { - "key_as_string": "2015/01/01 00:00:00", - "key": 1420070400000, - "doc_count": 3, - "sales": { - "value": 550 - } - }, - { - "key_as_string": "2015/02/01 00:00:00", - "key": 1422748800000, - "doc_count": 2, - "sales": { - "value": 60 - } - }, - { - "key_as_string": "2015/03/01 00:00:00", - "key": 1425168000000, - "doc_count": 2, - "sales": { - "value": 375 - } - } - ] - }, - "max_monthly_sales": { - "keys": ["2015/01/01 00:00:00"], <1> - "value": 550 - } - } -} --------------------------------------------------- - -<1> `keys` is an array of strings since the maximum value may be present in multiple buckets - diff --git a/docs/reference/search/aggregations/reducer/min-bucket-aggregation.asciidoc b/docs/reference/search/aggregations/reducer/min-bucket-aggregation.asciidoc deleted file mode 100644 index 558d0c1998..0000000000 --- a/docs/reference/search/aggregations/reducer/min-bucket-aggregation.asciidoc +++ /dev/null @@ -1,82 +0,0 @@ -[[search-aggregations-reducer-min-bucket-aggregation]] -=== Min Bucket Aggregation - -A sibling reducer aggregation which identifies the bucket(s) with the minimum value of a specified metric in a sibling aggregation -and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must -be a multi-bucket aggregation. - -The following snippet calculates the minimum of the total monthly `sales`: - -[source,js] --------------------------------------------------- -{ - "aggs" : { - "sales_per_month" : { - "date_histogram" : { - "field" : "date", - "interval" : "month" - }, - "aggs": { - "sales": { - "sum": { - "field": "price" - } - } - } - }, - "min_monthly_sales": { - "min_bucket": { - "buckets_paths": "sales_per_month>sales" <1> - } - } - } -} --------------------------------------------------- - -<1> `bucket_paths` instructs this max_bucket aggregation that we want the minimum value of the `sales` aggregation in the -`sales_per_month` date histogram. - -And the following may be the response: - -[source,js] --------------------------------------------------- -{ - "aggregations": { - "sales_per_month": { - "buckets": [ - { - "key_as_string": "2015/01/01 00:00:00", - "key": 1420070400000, - "doc_count": 3, - "sales": { - "value": 550 - } - }, - { - "key_as_string": "2015/02/01 00:00:00", - "key": 1422748800000, - "doc_count": 2, - "sales": { - "value": 60 - } - }, - { - "key_as_string": "2015/03/01 00:00:00", - "key": 1425168000000, - "doc_count": 2, - "sales": { - "value": 375 - } - } - ] - }, - "min_monthly_sales": { - "keys": ["2015/02/01 00:00:00"], <1> - "value": 60 - } - } -} --------------------------------------------------- - -<1> `keys` is an array of strings since the minimum value may be present in multiple buckets - diff --git a/docs/reference/search/aggregations/reducer/movavg-aggregation.asciidoc b/docs/reference/search/aggregations/reducer/movavg-aggregation.asciidoc deleted file mode 100644 index 03f6b7e9fa..0000000000 --- a/docs/reference/search/aggregations/reducer/movavg-aggregation.asciidoc +++ /dev/null @@ -1,294 +0,0 @@ -[[search-aggregations-reducers-movavg-reducer]] -=== Moving Average Aggregation - -Given an ordered series of data, the Moving Average aggregation will slide a window across the data and emit the average -value of that window. For example, given the data `[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`, we can calculate a simple moving -average with windows size of `5` as follows: - -- (1 + 2 + 3 + 4 + 5) / 5 = 3 -- (2 + 3 + 4 + 5 + 6) / 5 = 4 -- (3 + 4 + 5 + 6 + 7) / 5 = 5 -- etc - -Moving averages are a simple method to smooth sequential data. Moving averages are typically applied to time-based data, -such as stock prices or server metrics. The smoothing can be used to eliminate high frequency fluctuations or random noise, -which allows the lower frequency trends to be more easily visualized, such as seasonality. - -==== Syntax - -A `moving_avg` aggregation looks like this in isolation: - -[source,js] --------------------------------------------------- -{ - "movavg": { - "buckets_path": "the_sum", - "model": "double_exp", - "window": 5, - "gap_policy": "insert_zero", - "settings": { - "alpha": 0.8 - } - } -} --------------------------------------------------- - -.`moving_avg` Parameters -|=== -|Parameter Name |Description |Required |Default - -|`buckets_path` |The path to the metric that we wish to calculate a moving average for |Required | -|`model` |The moving average weighting model that we wish to use |Optional |`simple` -|`gap_policy` |Determines what should happen when a gap in the data is encountered. |Optional |`insert_zero` -|`window` |The size of window to "slide" across the histogram. |Optional |`5` -|`settings` |Model-specific settings, contents which differ depending on the model specified. |Optional | -|=== - - -`moving_avg` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be -embedded like any other metric aggregation: - -[source,js] --------------------------------------------------- -{ - "my_date_histo":{ <1> - "date_histogram":{ - "field":"timestamp", - "interval":"day" - }, - "aggs":{ - "the_sum":{ - "sum":{ "field": "lemmings" } <2> - }, - "the_movavg":{ - "moving_avg":{ "buckets_path": "the_sum" } <3> - } - } - } -} --------------------------------------------------- -<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals -<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc) -<3> Finally, we specify a `moving_avg` aggregation which uses "the_sum" metric as its input. - -Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally -add normal metrics, such as a `sum`, inside of that histogram. Finally, the `moving_avg` is embedded inside the histogram. -The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram. - -A moving average can also be calculated on the document count of each bucket, instead of a metric: - -[source,js] --------------------------------------------------- -{ - "my_date_histo":{ - "date_histogram":{ - "field":"timestamp", - "interval":"day" - }, - "aggs":{ - "the_movavg":{ - "moving_avg":{ "buckets_path": "_count" } <1> - } - } - } -} --------------------------------------------------- -<1> By using `_count` instead of a metric name, we can calculate the moving average of document counts in the histogram - -==== Models - -The `moving_avg` aggregation includes four different moving average "models". The main difference is how the values in the -window are weighted. As data-points become "older" in the window, they may be weighted differently. This will -affect the final average for that window. - -Models are specified using the `model` parameter. Some models may have optional configurations which are specified inside -the `settings` parameter. - -===== Simple - -The `simple` model calculates the sum of all values in the window, then divides by the size of the window. It is effectively -a simple arithmetic mean of the window. The simple model does not perform any time-dependent weighting, which means -the values from a `simple` moving average tend to "lag" behind the real data. - -[source,js] --------------------------------------------------- -{ - "the_movavg":{ - "moving_avg":{ - "buckets_path": "the_sum", - "model" : "simple" - } - } -} --------------------------------------------------- - -A `simple` model has no special settings to configure - -The window size can change the behavior of the moving average. For example, a small window (`"window": 10`) will closely -track the data and only smooth out small scale fluctuations: - -[[movavg_10window]] -.Moving average with window of size 10 -image::images/reducers_movavg/movavg_10window.png[] - -In contrast, a `simple` moving average with larger window (`"window": 100`) will smooth out all higher-frequency fluctuations, -leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount: - -[[movavg_100window]] -.Moving average with window of size 100 -image::images/reducers_movavg/movavg_100window.png[] - - -==== Linear - -The `linear` model assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at -the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce -the "lag" behind the data's mean, since older points have less influence. - -[source,js] --------------------------------------------------- -{ - "the_movavg":{ - "moving_avg":{ - "buckets_path": "the_sum", - "model" : "linear" - } -} --------------------------------------------------- - -A `linear` model has no special settings to configure - -Like the `simple` model, window size can change the behavior of the moving average. For example, a small window (`"window": 10`) -will closely track the data and only smooth out small scale fluctuations: - -[[linear_10window]] -.Linear moving average with window of size 10 -image::images/reducers_movavg/linear_10window.png[] - -In contrast, a `linear` moving average with larger window (`"window": 100`) will smooth out all higher-frequency fluctuations, -leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount, -although typically less than the `simple` model: - -[[linear_100window]] -.Linear moving average with window of size 100 -image::images/reducers_movavg/linear_100window.png[] - -==== Single Exponential - -The `single_exp` model is similar to the `linear` model, except older data-points become exponentially less important, -rather than linearly less important. The speed at which the importance decays can be controlled with an `alpha` -setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger -portion of the window. Larger valuers make the weight decay quickly, which reduces the impact of older values on the -moving average. This tends to make the moving average track the data more closely but with less smoothing. - -The default value of `alpha` is `0.5`, and the setting accepts any float from 0-1 inclusive. - -[source,js] --------------------------------------------------- -{ - "the_movavg":{ - "moving_avg":{ - "buckets_path": "the_sum", - "model" : "single_exp", - "settings" : { - "alpha" : 0.5 - } - } -} --------------------------------------------------- - - - -[[single_0.2alpha]] -.Single Exponential moving average with window of size 10, alpha = 0.2 -image::images/reducers_movavg/single_0.2alpha.png[] - -[[single_0.7alpha]] -.Single Exponential moving average with window of size 10, alpha = 0.7 -image::images/reducers_movavg/single_0.7alpha.png[] - -==== Double Exponential - -The `double_exp` model, sometimes called "Holt's Linear Trend" model, incorporates a second exponential term which -tracks the data's trend. Single exponential does not perform well when the data has an underlying linear trend. The -double exponential model calculates two values internally: a "level" and a "trend". - -The level calculation is similar to `single_exp`, and is an exponentially weighted view of the data. The difference is -that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series. -The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the -smoothed data). The trend value is also exponentially weighted. - -Values are produced by multiplying the level and trend components. - -The default value of `alpha` and `beta` is `0.5`, and the settings accept any float from 0-1 inclusive. - -[source,js] --------------------------------------------------- -{ - "the_movavg":{ - "moving_avg":{ - "buckets_path": "the_sum", - "model" : "double_exp", - "settings" : { - "alpha" : 0.5, - "beta" : 0.5 - } - } -} --------------------------------------------------- - -In practice, the `alpha` value behaves very similarly in `double_exp` as `single_exp`: small values produce more smoothing -and more lag, while larger values produce closer tracking and less lag. The value of `beta` is often difficult -to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger -values emphasize short-term trends. This will become more apparently when you are predicting values. - -[[double_0.2beta]] -.Double Exponential moving average with window of size 100, alpha = 0.5, beta = 0.2 -image::images/reducers_movavg/double_0.2beta.png[] - -[[double_0.7beta]] -.Double Exponential moving average with window of size 100, alpha = 0.5, beta = 0.7 -image::images/reducers_movavg/double_0.7beta.png[] - -=== Prediction - -All the moving average model support a "prediction" mode, which will attempt to extrapolate into the future given the -current smoothed, moving average. Depending on the model and parameter, these predictions may or may not be accurate. - -Predictions are enabled by adding a `predict` parameter to any moving average aggregation, specifying the nubmer of -predictions you would like appended to the end of the series. These predictions will be spaced out at the same interval -as your buckets: - -[source,js] --------------------------------------------------- -{ - "the_movavg":{ - "moving_avg":{ - "buckets_path": "the_sum", - "model" : "simple", - "predict" 10 - } -} --------------------------------------------------- - -The `simple`, `linear` and `single_exp` models all produce "flat" predictions: they essentially converge on the mean -of the last value in the series, producing a flat: - -[[simple_prediction]] -.Simple moving average with window of size 10, predict = 50 -image::images/reducers_movavg/simple_prediction.png[] - -In contrast, the `double_exp` model can extrapolate based on local or global constant trends. If we set a high `beta` -value, we can extrapolate based on local constant trends (in this case the predictions head down, because the data at the end -of the series was heading in a downward direction): - -[[double_prediction_local]] -.Double Exponential moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.8 -image::images/reducers_movavg/double_prediction_local.png[] - -In contrast, if we choose a small `beta`, the predictions are based on the global constant trend. In this series, the -global trend is slightly positive, so the prediction makes a sharp u-turn and begins a positive slope: - -[[double_prediction_global]] -.Double Exponential moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.1 -image::images/reducers_movavg/double_prediction_global.png[] -- cgit v1.2.3