Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm to search a buffer for any string from a list

I am looking for an efficient search algorithm, that, for a given set of strings searches a large buffer for any one match from the set of strings. Currently i know a few efficient single-string algorithms (i have used the Knuth before), but i don't know if they really help.

Here is what i am actually doing:

  • I have around 6-10 predefined strings, each around 200-300 characters (actually bytes, since i`m processing binary data)
  • The input is a large, sometimes few megabyte buffer
  • I would like to process the buffer, and when i have a match, i would like to stop the search

I have looked for multiple-string searching algorithms using a finite set of predefined patterns, but they all seem to revolve around matching ALL of the predefined strings in the buffer.

This post: Fast algorithm for searching for substrings in a string, suggested using the Aho–Corasick or the Rabin–Karp alogirthm.

I thought, that since i only need one match, i could find other methods, that are similar to the mentioned algorithms, but the constrains given by the problem can improve the performance.

like image 545
user2281752 Avatar asked Mar 11 '26 23:03

user2281752


1 Answers

Aho-Corasick is a good choice here. After building an automaton the input string is traversed from left to right so it is possible to stop immediately after the first match is found. The time complexity is O(sum of lengths of all patterns + the position of the first occurrence). It is optimal because it is not possible to find the first match without reading all patterns and all the bytes from the buffer before the first occurrence.

like image 146
kraskevich Avatar answered Mar 14 '26 16:03

kraskevich