Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force Match phrase to discard results with full email searching only its domain

I'd like to find in my ElasticSearch index the string outlook.com inside a text with a match_phrase query. But I don't want results that are [email protected], that are taken with this query:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "match_phrase": {
            "message": {
              "query": "outlook.com",
              "slop": 0
            }
          }
        }
      ]
    }
  }
}

I think that these results are taken because the tokenizer of the standard analyzer separate [email protected] in [something...],[outlook.com] with @ as separator.

I tried to put the analyzer whitespace to tokenize as [[email protected]] and avoid taking the full emails as results. But with this query:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "match_phrase": {
            "message": {
              "query": "outlook.com",
              "slop": 0,
              "analyzer": "whitespace",
            }
          }
        }
      ]
    }
  }
}

still finds results like [email protected]. How can I do?

UPDATE:

In my mapping, I set standard analyzer a time ago. So my intuition is that even if I use a whitespace analyzer at search time, the documents are already tokenized with the standard one, so the tokenization is no more changeable after the indexing time.

I tried doing a painless script to match a certain pattern, but my field is type text so the search takes too much time.

Otherwise, a regexp query can do something similar:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "regexp": {
            "message": ".*[^A-Za-z0-9\\@]outlook.com[^A-Za-z0-9\\@].*"
          }
        }
      ]
    }
  }
}

But unfortunately reading regexp syntax documentation there is a limited set of operators. For example with this regex [^A-Za-z0-9\\@] I mean any characters, but not a @ before outlook.com and not an alphanumeric character (this is to simulate the word boundary that we could have with the match_phrase query). My problem is that if the field starts or ends with Outlook.com, it's not retrieved because the regex doesn't find a character before or after ([^A-Za-z0-9\\@] doesn't match the empty string).

like image 947
Paolo Magnani Avatar asked Dec 05 '25 18:12

Paolo Magnani


1 Answers

you can use the regexp query instead of match_phrase like this:

{  "query":{
    "bool": {
      "must": [
        {
          "regexp": {
            "message": ".*[^@]outlook.com"
          }
        }
      ]
    }
  }
}
like image 116
Mouad Slimane Avatar answered Dec 09 '25 17:12

Mouad Slimane



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!