Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding Matching Strings Algorithm

I have very long 5 strings (the number of strings may change).There is no fixed format for these strings. I will provide a number which will indicate the length of the substring. I want to find the matching substrings with the given length. For example the strings are:

      1.     abcabcabc
      2.     abcasdfklop
     

string length: 3

Given these values the output will be something like this:

Match #1:

 Matched string :               "abc"

 Matches in first string:        3

 Matching positions:             0,3,6

 Matches in second string:       1

 Match positions:                0

Match #2:

 Matched string :               "bca"

 Matches in first string:        2

 Matching positions:             1,4

 Matches in second string:       1

 Match    positions:             1

I managed to do it in 4 foreach statement. But it seemed to me too unefficient. Especially if the input sizes are very big.Is there any suggestion or short way to manage this more efficient in c#?

like image 331
Ozgur Dogus Avatar asked Jan 25 '26 20:01

Ozgur Dogus


1 Answers

You can do this with a suffix array. (Suffix trees will work fine too, but they require a bit more space, time, and care in implementation.)

Concatenate your two strings, separating them with a character that occurs in neither one. Then build a suffix array. Then you can read off your answer.

Standard suffix arrays give you a lexicographically sorted array of pointers to suffixes of the string together with a "longest common prefix length" array telling you how long the longest common prefix of two lexicographically consecutive suffixes is.

It is fairly straightforward to use the longest common prefix length array to get the information you want; find all maximal subarrays of the longest common prefix length array for which the longest common prefix length is at least the query length, then, for each one that has a match both in the first string and in the second string, report the appropriate prefix and report that it occurs K+1 times, where K is the length of the maximal subarray.

Another approach that's easier to code is to hash all substrings of the appropriate length. You can do this easily with any rolling hash function. Store a dynamic array of pointers into the strings for each hash; once you've hashed all the strings, iterate over all of the hashes that came up and look for matches. You'll need to deal with the false positives somehow; one (probabilistic) approach is to use several hash functions until the false positive probability is acceptably small. Another approach, which is likely only acceptable in the case where you have few matches, is to compare the strings directly.

like image 190
tmyklebu Avatar answered Jan 27 '26 10:01

tmyklebu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!