Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dots in field aren't used to break up word for analyzer

I have following mapping for the index documents (simplified)

{
    "documents": {
        "mappings": {
            "document": {
                "properties": {
                    "filename": {
                        "type": "string",
                        "fields": {
                            "lower_case_sort": {
                                "type": "string",
                                "analyzer": "case_insensitive_sort"
                            },
                            "raw": {
                                "type": "string",
                                "index": "not_analyzed"
                            }
                        }
                    }
                }
            }
        }
    }
}

I put two documents to this index

{
    "_index": "documents",
    "_type": "document",
    "_id": "777",
    "_source": {
        "filename": "text.txt",
    }
}

...

{
    "_index": "documents",
    "_type": "document",
    "_id": "888",
    "_source": {
        "filename": "text 123.txt",
    }
}

Doing a query_string or simple_query_string query against "text" I would have expected to get both documents back. They should match due to the filenames are "text.txt" and "text 123.txt".

http://localhost:9200/defiant/_search?q=text

However, I only find the document with the name "test 123.txt" - "test.txt" is only found if I search for "test.*" or "test.txt" or "test.???" - I MUST add the dot in the filename.

This is my explain result against document id 777 (text.txt)

curl -XGET 'http://localhost:9200/documents/document/777/_explain' -d '{"query": {"query_string" : {"query" : "text"}}}'

-->

{
    "_index": "documents",
    "_type": "document",
    "_id": "777",
    "matched": false,
    "explanation": {
        "value": 0.0,
        "description": "Failure to meet condition(s) of required/prohibited clause(s)",
        "details": [{
            "value": 0.0,
            "description": "no match on required clause (_all:text)",
            "details": [{
                "value": 0.0,
                "description": "no matching term",
                "details": []
            }]
        }, {
            "value": 0.0,
            "description": "match on required clause, product of:",
            "details": [{
                "value": 0.0,
                "description": "# clause",
                "details": []
            }, {
                "value": 0.47650534,
                "description": "_type:document, product of:",
                "details": [{
                    "value": 1.0,
                    "description": "boost",
                    "details": []
                }, {
                    "value": 0.47650534,
                    "description": "queryNorm",
                    "details": []
                }]
            }]
        }]
    }
}

Did I screw up the mapping? I would have thought that the '.' is analyzed as a term separator when the document is indexed...

Edited: Settings of case_insensitive_sort

{
    "documents": {
        "settings": {
            "index": {
                "creation_date": "1473169458336",
                "analysis": {
                    "analyzer": {
                        "case_insensitive_sort": {
                            "filter": [
                                "lowercase"
                            ],
                            "tokenizer": "keyword"
                        }
                    }
                }
            }
        }
    }
}
like image 201
m_c Avatar asked Sep 12 '25 16:09

m_c


1 Answers

This would be expected behavior of standard analyzer(Default analyzer) since it uses standard tokenizer and as per the algorithm used by it, dot is not considered as separating character.

You can verify this with the help of analyze api

curl -XGET 'localhost:9200/_analyze' -d '
{
  "analyzer" : "standard",
  "text" : "test.txt"
}'

only single token is generated

{
  "tokens": [
    {
      "token": "test.txt",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

You could use pattern replace char filter to replace dot with empty space.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "replace_dot"
          ]
        }
      },
      "char_filter": {
        "replace_dot": {
          "type": "pattern_replace",
          "pattern": "\\.",
          "replacement": " "
        }
      }
    }
  }
}

You would have to reindex your documents and then you will get desired results. Analyze api is very handy to check how your documents are getting stored in inverted index.

UPDATE

You would have to specify the name of field you want to search on. The following request looks for text in _all field which by default uses standard analyzer.

http://localhost:9200/defiant/_search?q=text

I think following query should give you desired result.

curl -XGET 'http://localhost:9200/twitter/_search?q=filename:text'
like image 115
ChintanShah25 Avatar answered Sep 14 '25 18:09

ChintanShah25