Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch custom analyzer for hyphens, underscores, and numbers

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "my_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "my_filter"]
                }
            }
        }
    }
}

You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":

{
    "query": {
        "match": {
            "hostname": "WIN_1"
        }
    }
}

The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.

{
    "tokens": [
        {
            "token": "win_1",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "win",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "1",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        }
    ]
}

What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.

{
    "ipaddress": "192.168.1.253",
    "hostname": "WIN_8_ENT_1"
}
{
    "ipaddress": "10.0.0.1",
    "hostname": "server1"
}
{
    "ipaddress": "172.20.10.36",
    "hostname": "ServA-1"
}

Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.

like image 830
Deviation Avatar asked Sep 03 '25 08:09

Deviation


1 Answers

You could change your analysis to use a pattern analyzer that discards the digits and under scores:

{
   "analysis": {
      "analyzer": {
          "word_only": {
              "type": "pattern",
              "pattern": "([^\p{L}]+)"
          }
       }
    }
}

Using the analyze API:

curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'

returns:

"tokens" : [ {
    "token" : "win",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
}, {
    "token" : "ent",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 2
} ]

Your mapping would become:

{
    "event": {
        "properties": {
            "ipaddress": {
                 "type": "string"
             },
             "hostname": {
                 "type": "string",
                 "analyzer": "word_only",
                 "fields": {
                     "raw": {
                         "type": "string",
                         "index": "not_analyzed"
                     }
                 }
             }
         }
    }
}

You can use a multi_match query to get the results you want:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN_1"
       }
   }
}
like image 79
Dan Tuffery Avatar answered Sep 05 '25 01:09

Dan Tuffery