Finding Matching Strings Algorithm

Question

I have very long 5 strings (the number of strings may change).There is no fixed format for these strings. I will provide a number which will indicate the length of the substring. I want to find the matching substrings with the given length. For example the strings are:

      1.     abcabcabc
      2.     abcasdfklop

string length: 3

Given these values the output will be something like this:

Match #1:

 Matched string :               "abc"

 Matches in first string:        3

 Matching positions:             0,3,6

 Matches in second string:       1

 Match positions:                0

Match #2:

 Matched string :               "bca"

 Matches in first string:        2

 Matching positions:             1,4

 Matches in second string:       1

 Match    positions:             1

I managed to do it in 4 foreach statement. But it seemed to me too unefficient. Especially if the input sizes are very big.Is there any suggestion or short way to manage this more efficient in c#?

tmyklebu · Accepted Answer

You can do this with a suffix array. (Suffix trees will work fine too, but they require a bit more space, time, and care in implementation.)

Concatenate your two strings, separating them with a character that occurs in neither one. Then build a suffix array. Then you can read off your answer.

Standard suffix arrays give you a lexicographically sorted array of pointers to suffixes of the string together with a "longest common prefix length" array telling you how long the longest common prefix of two lexicographically consecutive suffixes is.

It is fairly straightforward to use the longest common prefix length array to get the information you want; find all maximal subarrays of the longest common prefix length array for which the longest common prefix length is at least the query length, then, for each one that has a match both in the first string and in the second string, report the appropriate prefix and report that it occurs K+1 times, where K is the length of the maximal subarray.

Another approach that's easier to code is to hash all substrings of the appropriate length. You can do this easily with any rolling hash function. Store a dynamic array of pointers into the strings for each hash; once you've hashed all the strings, iterate over all of the hashes that came up and look for matches. You'll need to deal with the false positives somehow; one (probabilistic) approach is to use several hash functions until the false positive probability is acceptably small. Another approach, which is likely only acceptable in the case where you have few matches, is to compare the strings directly.

Finding Matching Strings Algorithm

Tags:

performance

substring

c#

.net

algorithm

Ozgur Dogus

1 Answers

tmyklebu

Recent Activity

Donate For Us

Finding Matching Strings Algorithm

Tags:

performance

substring

c#

.net

algorithm

Ozgur Dogus

1 Answers

tmyklebu

Related questions

Recent Activity

Donate For Us