summaryrefslogtreecommitdiff
path: root/docs/reference/index-modules/similarity.asciidoc
blob: 7930ed573b46207eb12fb36e150b025930cddb1e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
[[index-modules-similarity]]
== Similarity module

A similarity (scoring / ranking model) defines how matching documents
are scored. Similarity is per field, meaning that via the mapping one
can define a different similarity per field.

Configuring a custom similarity is considered a expert feature and the
builtin similarities are most likely sufficient as is described in
<<similarity>>.

[float]
[[configuration]]
=== Configuring a similarity

Most existing or custom Similarities have configuration options which
can be configured via the index settings as shown below. The index
options can be provided when creating an index or updating index
settings.

[source,js]
--------------------------------------------------
"similarity" : {
  "my_similarity" : {
    "type" : "DFR",
    "basic_model" : "g",
    "after_effect" : "l",
    "normalization" : "h2",
    "normalization.h2.c" : "3.0"
  }
}
--------------------------------------------------

Here we configure the DFRSimilarity so it can be referenced as
`my_similarity` in mappings as is illustrate in the below example:

[source,js]
--------------------------------------------------
{
  "book" : {
    "properties" : {
      "title" : { "type" : "text", "similarity" : "my_similarity" }
    }
}
--------------------------------------------------

[float]
=== Available similarities

[float]
[[bm25]]
==== BM25 similarity (*default*)

TF/IDF based similarity that has built-in tf normalization and
is supposed to work better for short fields (like names). See
http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
This similarity has the following options:

[horizontal]
`k1`::
    Controls non-linear term frequency normalization
    (saturation). The default value is `1.2`.

`b`::
    Controls to what degree document length normalizes tf values.
    The default value is `0.75`.

`discount_overlaps`::
    Determines whether overlap tokens (Tokens with
    0 position increment) are ignored when computing norm. By default this
    is true, meaning overlap tokens do not count when computing norms.

Type name: `BM25`

[float]
[[classic-similarity]]
==== Classic similarity

The classic similarity that is based on the TF/IDF model. This
similarity has the following option:

`discount_overlaps`::
    Determines whether overlap tokens (Tokens with
    0 position increment) are ignored when computing norm. By default this
    is true, meaning overlap tokens do not count when computing norms.

Type name: `classic`

[float]
[[drf]]
==== DFR similarity

Similarity that implements the
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
from randomness] framework. This similarity has the following options:

[horizontal]
`basic_model`::
    Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.

`after_effect`::
    Possible values: `no`, `b` and `l`.

`normalization`::
    Possible values: `no`, `h1`, `h2`, `h3` and `z`.

All options but the first option need a normalization value.

Type name: `DFR`

[float]
[[dfi]]
==== DFI similarity

Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence] 
model.
This similarity has the following options:

[horizontal]
`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.

Type name: `DFI`

[float]
[[ib]]
==== IB similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
sequence is primarily determined by the repetitive usage of its basic elements.
For written texts this challenge would correspond to comparing the writing styles of different authors.
This similarity has the following options:

[horizontal]
`distribution`::  Possible values: `ll` and `spl`.
`lambda`::        Possible values: `df` and `ttf`.
`normalization`:: Same as in `DFR` similarity.

Type name: `IB`

[float]
[[lm_dirichlet]]
==== LM Dirichlet similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
Dirichlet similarity] . This similarity has the following options:

[horizontal]
`mu`::  Default to `2000`.

Type name: `LMDirichlet`

[float]
[[lm_jelinek_mercer]]
==== LM Jelinek Mercer similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:

[horizontal]
`lambda`::  The optimal value depends on both the collection and the query. The optimal value is around `0.1`
for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.

Type name: `LMJelinekMercer`

[float]
[[default-base]]
==== Default and Base Similarities

By default, Elasticsearch will use whatever similarity is configured as
`default`. However, the similarity functions `queryNorm()` and `coord()`
are not per-field. Consequently, for expert users wanting to change the
implementation used for these two methods, while not changing the
`default`, it is possible to configure a similarity with the name
`base`. This similarity will then be used for the two methods.

You can change the default similarity for all fields in an index when
it is <<indices-create-index,created>>:

[source,js]
--------------------------------------------------
PUT /my_index
{
  "settings": {
    "index": {
      "similarity": {
        "default": {
          "type": "classic"
        }
      }
    }
  }
}
--------------------------------------------------

If you want to change the default similarity after creating the index
you must <<indices-open-close,close>> your index, send the follwing
request and <<indices-open-close,open>> it again afterwards:

[source,js]
--------------------------------------------------
PUT /my_index/_settings
{
  "settings": {
    "index": {
      "similarity": {
        "default": {
          "type": "classic"
        }
      }
    }
  }
}
--------------------------------------------------