summaryrefslogtreecommitdiff
path: root/docs/reference/query-dsl/common-terms-query.asciidoc
blob: a0c58597f7a5ad82d192da332cdb1c00d1b3883e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
[[query-dsl-common-terms-query]]
=== Common Terms Query

The `common` terms query is a modern alternative to stopwords which
improves the precision and recall of search results (by taking stopwords
into account), without sacrificing performance.

[float]
==== The problem

Every term in a query has a cost. A search for `"The brown fox"`
requires three term queries, one for each of `"the"`, `"brown"` and
`"fox"`, all of which are executed against all documents in the index.
The query for `"the"` is likely to match many documents and thus has a
much smaller impact on relevance than the other two terms.

Previously, the solution to this problem was to ignore terms with high
frequency. By treating `"the"` as a _stopword_, we reduce the index size
and reduce the number of term queries that need to be executed.

The problem with this approach is that, while stopwords have a small
impact on relevance, they are still important. If we remove stopwords,
we lose precision, (eg we are unable to distinguish between `"happy"`
and `"not happy"`) and we lose recall (eg text like `"The The"` or
`"To be or not to be"` would simply not exist in the index).

[float]
==== The solution

The `common` terms query divides the query terms into two groups: more
important (ie _low frequency_ terms) and less important (ie _high
frequency_ terms which would previously have been stopwords).

First it searches for documents which match the more important terms.
These are the terms which appear in fewer documents and have a greater
impact on relevance.

Then, it executes a second query for the less important terms -- terms
which appear frequently and have a low impact on relevance. But instead
of calculating the relevance score for *all* matching documents, it only
calculates the `_score` for documents already matched by the first
query. In this way the high frequency terms can improve the relevance
calculation without paying the cost of poor performance.

If a query consists only of high frequency terms, then a single query is
executed as an `AND` (conjunction) query, in other words all terms are
required. Even though each individual term will match many documents,
the combination of terms narrows down the resultset to only the most
relevant. The single query can also be executed as an `OR` with a
specific
<<query-dsl-minimum-should-match,`minimum_should_match`>>,
in this case a high enough value should probably be used.

Terms are allocated to the high or low frequency groups based on the
`cutoff_frequency`, which can be specified as an absolute frequency
(`>=1`) or as a relative frequency (`0.0 .. 1.0`). (Remember that document
frequencies are computed on a per shard level as explained in the blog post
{defguide}/relevance-is-broken.html[Relevance is broken].)

Perhaps the most interesting property of this query is that it adapts to
domain specific stopwords automatically. For example, on a video hosting
site, common terms like `"clip"` or `"video"` will automatically behave
as stopwords without the need to maintain a manual list.

[float]
==== Examples

In this example, words that have a document frequency greater than 0.1%
(eg `"this"` and `"is"`) will be treated as _common terms_.

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "this is bonsai cool",
                    "cutoff_frequency": 0.001
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

The number of terms which should match can be controlled with the
<<query-dsl-minimum-should-match,`minimum_should_match`>>
(`high_freq`, `low_freq`), `low_freq_operator` (default `"or"`) and
`high_freq_operator` (default `"or"`) parameters.

For low frequency terms, set the `low_freq_operator` to `"and"` to make
all terms required:

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant as a cartoon",
                    "cutoff_frequency": 0.001,
                    "low_freq_operator": "and"
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

which is roughly equivalent to:

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "bool": {
            "must": [
            { "term": { "body": "nelly"}},
            { "term": { "body": "elephant"}},
            { "term": { "body": "cartoon"}}
            ],
            "should": [
            { "term": { "body": "the"}},
            { "term": { "body": "as"}},
            { "term": { "body": "a"}}
            ]
        }
    }
}
--------------------------------------------------
// CONSOLE

Alternatively use
<<query-dsl-minimum-should-match,`minimum_should_match`>>
to specify a minimum number or percentage of low frequency terms which
must be present, for instance:

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant as a cartoon",
                "cutoff_frequency": 0.001,
                "minimum_should_match": 2
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

which is roughly equivalent to:

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "bool": {
            "must": {
                "bool": {
                    "should": [
                    { "term": { "body": "nelly"}},
                    { "term": { "body": "elephant"}},
                    { "term": { "body": "cartoon"}}
                    ],
                    "minimum_should_match": 2
                }
            },
            "should": [
                { "term": { "body": "the"}},
                { "term": { "body": "as"}},
                { "term": { "body": "a"}}
                ]
        }
    }
}
--------------------------------------------------
// CONSOLE

minimum_should_match

A different
<<query-dsl-minimum-should-match,`minimum_should_match`>>
can be applied for low and high frequency terms with the additional
`low_freq` and `high_freq` parameters. Here is an example when providing
additional parameters (note the change in structure):

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant not as a cartoon",
                    "cutoff_frequency": 0.001,
                    "minimum_should_match": {
                        "low_freq" : 2,
                        "high_freq" : 3
                    }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

which is roughly equivalent to:

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "bool": {
            "must": {
                "bool": {
                    "should": [
                    { "term": { "body": "nelly"}},
                    { "term": { "body": "elephant"}},
                    { "term": { "body": "cartoon"}}
                    ],
                    "minimum_should_match": 2
                }
            },
            "should": {
                "bool": {
                    "should": [
                    { "term": { "body": "the"}},
                    { "term": { "body": "not"}},
                    { "term": { "body": "as"}},
                    { "term": { "body": "a"}}
                    ],
                    "minimum_should_match": 3
                }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

In this case it means the high frequency terms have only an impact on
relevance when there are at least three of them. But the most
interesting use of the
<<query-dsl-minimum-should-match,`minimum_should_match`>>
for high frequency terms is when there are only high frequency terms:

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "how not to be",
                    "cutoff_frequency": 0.001,
                    "minimum_should_match": {
                        "low_freq" : 2,
                        "high_freq" : 3
                    }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE

which is roughly equivalent to:

[source,js]
--------------------------------------------------
GET /_search
{
    "query": {
        "bool": {
            "should": [
            { "term": { "body": "how"}},
            { "term": { "body": "not"}},
            { "term": { "body": "to"}},
            { "term": { "body": "be"}}
            ],
            "minimum_should_match": "3<50%"
        }
    }
}
--------------------------------------------------
// CONSOLE

The high frequency generated query is then slightly less restrictive
than with an `AND`.

The `common` terms query also supports `boost` and `analyzer` as
parameters.