So I have this index
{
"settings":{
"index":{
"number_of_replicas":0,
"analysis":{
"analyzer":{
"default":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase",
"my_ngram"
]
}
},
"filter":{
"my_ngram":{
"type":"nGram",
"min_gram":2,
"max_gram":20
}
}
}
}
}
}
and I'm performing this search through the tire gem
{
"query":{
"query_string":{
"query":"xyz",
"default_operator":"AND"
}
},
"sort":[
{
"count":"desc"
}
],
"filter":{
"term":{
"active":true,
"_type":null
}
},
"highlight":{
"fields":{
"name":{
}
},
"pre_tags":[
"<strong>"
],
"post_tags":[
"</strong>"
]
}
}
and I have two posts that should match named 'xyz post' and 'xyz question' When I perform this search, I get the highlighted fields back properly
<strong>xyz</strong> question
<strong>xyz</strong> post
Now here's the thing ... as soon as I change min_gram to 1 in my index and reindex. the highlighted fields start coming back as this
<strong>x</strong><strong>y</strong><strong>z</strong> pos<strong>xyz</strong>t
<strong>x</strong><strong>y</strong><strong>z</strong> questio<strong>xyz</strong>n
I simply cannot understand why.
You need to check your mapping and see if you use fast-vector-highlighter. But still you need to be quite careful about your queries.
Assume using fresh instance of ES 0.20.4 on localhost.
Building on top of your example, let's add explicit mappings. Note I setup two different analysis for the code field. The only difference is "term_vector":"with_positions_offsets".
curl -X PUT localhost:9200/myindex -d '
{
"settings" : {
"index":{
"number_of_replicas":0,
"number_of_shards":1,
"analysis":{
"analyzer":{
"default":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase",
"my_ngram"
]
}
},
"filter":{
"my_ngram":{
"type":"nGram",
"min_gram":1,
"max_gram":20
}
}
}
}
},
"mappings" : {
"product" : {
"properties" : {
"code" : {
"type" : "multi_field",
"fields" : {
"code" : {
"type" : "string",
"analyzer" : "default",
"store" : "yes"
},
"code.ngram" : {
"type" : "string",
"analyzer" : "default",
"store" : "yes",
"term_vector":"with_positions_offsets"
}
}
}
}
}
}
}'
Index some data.
curl -X POST 'localhost:9200/myindex/product' -d '{
"code" : "Samsung Galaxy i7500"
}'
curl -X POST 'localhost:9200/myindex/product' -d '{
"code" : "Samsung Galaxy 5 Europa"
}'
curl -X POST 'localhost:9200/myindex/product' -d '{
"code" : "Samsung Galaxy Mini"
}'
And now we can run queries.
curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
"fields" : [ "code" ],
"query" : {
"term" : {
"code" : "i"
}
},
"highlight" : {
"number_of_fragments" : 0,
"fields" : {
"code":{},
"code.ngram":{}
}
}
}'
This yields two search hits:
# 1
...
"fields" : {
"code" : "Samsung Galaxy Mini"
},
"highlight" : {
"code.ngram" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ],
"code" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ]
}
# 2
...
"fields" : {
"code" : "Samsung Galaxy i7500"
},
"highlight" : {
"code.ngram" : [ "Samsung Galaxy <em>i</em>7500" ],
"code" : [ "Samsung Galaxy <em>i</em>7500" ]
}
Both the code and code.ngem fields were correctly highlighted this time. But things change quickly when longer query is used:
curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
"fields" : [ "code" ],
"query" : {
"term" : {
"code" : "y m"
}
},
"highlight" : {
"number_of_fragments" : 0,
"fields" : {
"code":{},
"code.ngram":{}
}
}
}'
This yields:
"fields" : {
"code" : "Samsung Galaxy Mini"
},
"highlight" : {
"code.ngram" : [ "Samsung Galax<em>y M</em>ini" ],
"code" : [ "Samsung Galaxy Min<em>y M</em>i" ]
}
The code fields is not highlighted correctly (similar to your case).
One important thing is that term query is used instead of query_string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With