diff options
Diffstat (limited to 'docs/reference/aggregations/bucket/sampler-aggregation.asciidoc')
-rw-r--r-- | docs/reference/aggregations/bucket/sampler-aggregation.asciidoc | 154 |
1 files changed, 154 insertions, 0 deletions
diff --git a/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc b/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc new file mode 100644 index 0000000000..5ad9dbc019 --- /dev/null +++ b/docs/reference/aggregations/bucket/sampler-aggregation.asciidoc @@ -0,0 +1,154 @@ +[[search-aggregations-bucket-sampler-aggregation]] +=== Sampler Aggregation + +experimental[] + +A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents. +Optionally, diversity settings can be used to limit the number of matches that share a common value such as an "author". + +.Example use cases: +* Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches +* Removing bias from analytics by ensuring fair representation of content from different sources +* Reducing the running cost of aggregations that can produce useful results using only samples e.g. `significant_terms` + + +Example: + +[source,js] +-------------------------------------------------- +{ + "query": { + "match": { + "text": "iphone" + } + }, + "aggs": { + "sample": { + "sampler": { + "shard_size": 200, + "field" : "user.id" + }, + "aggs": { + "keywords": { + "significant_terms": { + "field": "text" + } + } + } + } + } +} +-------------------------------------------------- + +Response: + +[source,js] +-------------------------------------------------- +{ + ... + "aggregations": { + "sample": { + "doc_count": 1000,<1> + "keywords": {<2> + "doc_count": 1000, + "buckets": [ + ... + { + "key": "bend", + "doc_count": 58, + "score": 37.982536582524276, + "bg_count": 103 + }, + .... +} +-------------------------------------------------- + +<1> 1000 documents were sampled in total becase we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded. +<2> The results of the significant_terms aggregation are not skewed by any single over-active Twitter user because we asked for a maximum of one tweet from any one user in our sample. + + +==== shard_size + +The `shard_size` parameter limits how many top-scoring documents are collected in the sample processed on each shard. +The default value is 100. + +=== Controlling diversity +Optionally, you can use the `field` or `script` and `max_docs_per_value` settings to control the maximum number of documents collected on any one shard which share a common value. +The choice of value (e.g. `author`) is loaded from a regular `field` or derived dynamically by a `script`. + +The aggregation will throw an error if the choice of field or script produces multiple values for a document. +It is currently not possible to offer this form of de-duplication using many values, primarily due to concerns over efficiency. + +NOTE: Any good market researcher will tell you that when working with samples of data it is important +that the sample represents a healthy variety of opinions rather than being skewed by any single voice. +The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer). + +==== Field + +Controlling diversity using a field: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sample" : { + "sampler" : { + "field" : "author", + "max_docs_per_value" : 3 + } + } + } +} +-------------------------------------------------- + +Note that the `max_docs_per_value` setting applies on a per-shard basis only for the purposes of shard-local sampling. +It is not intended as a way of providing a global de-duplication feature on search results. + + + +==== Script + +Controlling diversity using a script: + +[source,js] +-------------------------------------------------- +{ + "aggs" : { + "sample" : { + "sampler" : { + "script" : "doc['author'].value + '/' + doc['genre'].value" + } + } + } +} +-------------------------------------------------- +Note in the above example we chose to use the default `max_docs_per_value` setting of 1 and combine author and genre fields to ensure +each shard sample has, at most, one match for an author/genre pair. + + +==== execution_hint + +When using the settings to control diversity, the optional `execution_hint` setting can influence the management of the values used for de-duplication. +Each option will hold up to `shard_size` values in memory while performing de-duplication but the type of value held can be controlled as follows: + + - hold field values directly (`map`) + - hold ordinals of the field as determined by the Lucene index (`global_ordinals`) + - hold hashes of the field values - with potential for hash collisions (`bytes_hash`) + +The default setting is to use `global_ordinals` if this information is available from the Lucene index and reverting to `map` if not. +The `bytes_hash` setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions. +Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints. + +=== Limitations + +==== Cannot be nested under `breadth_first` aggregations +Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. +It therefore cannot be nested under a `terms` aggregation which has the `collect_mode` switched from the default `depth_first` mode to `breadth_first` as this discards scores. +In this situation an error will be thrown. + +==== Limited de-dup logic. +The de-duplication logic in the diversify settings applies only at a shard level so will not apply across shards. + +==== No specialized syntax for geo/date fields +Currently the syntax for defining the diversifying values is defined by a choice of `field` or `script` - there is no added syntactical sugar for expressing geo or date units such as "1w" (1 week). +This support may be added in a later release and users will currently have to create these sorts of values using a script.
\ No newline at end of file |