Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best solution for filtering out Elasticsearch results with hate words?

I want to filter out docs with hate words in my ealsticsearch result. Currently we are having bool filter in every search query for the list of all words. And this results in tons of slow queries, since the list of hate words is long (So much of hatred around :( )

I was wondering what are the best practices for this spam/hate words filtering.

Here are what we are considering:

  1. Pre-process : Scan the doc prior to indexing and hence mark them bad or do not index them. Problem : The documents are indexed from several processes and it is difficult to force the rule on any new component some one writes.

  2. Creating a percolator and running it periodically (Not sure of the best frequency and timing) to tag all documents with bad words as "badDoc" : true. Hence have a filter in all the queries. Problem: Not sure of the performance impact due to periodical running of percolator, secondly the same problem of discipline in all queries to exclude badDoc

Personally I would favor a pure ES solution and I am sure this is not a new problem, and hence seeking expert guidance and best practices.

Thanks and Regards Varun

like image 555
lazywiz Avatar asked Oct 21 '25 15:10

lazywiz


1 Answers

Using percolator to tag as bad document will also need to define a percolator which include the search criteria of all the "hate words".

One possible solution without percolator could be by defining a synonym list(if not using already) or extending the already existing synonym file in your analyzer. You can define a synonym for all the "hate words" so that they gets replaced by a single term say "badbaddocument". Now during query you can filter out the bad documents using a simple Boolean filter containing a single term.

like image 64
Prabin Meitei Avatar answered Oct 23 '25 07:10

Prabin Meitei



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!