I have very long 5 strings (the number of strings may change).There is no fixed format for these strings. I will provide a number which will indicate the length of the substring. I want to find the matching substrings with the given length. For example the strings are:
1. abcabcabc
2. abcasdfklop
string length: 3
Given these values the output will be something like this:
Match #1:
Matched string : "abc"
Matches in first string: 3
Matching positions: 0,3,6
Matches in second string: 1
Match positions: 0
Match #2:
Matched string : "bca"
Matches in first string: 2
Matching positions: 1,4
Matches in second string: 1
Match positions: 1
I managed to do it in 4 foreach statement. But it seemed to me too unefficient. Especially if the input sizes are very big.Is there any suggestion or short way to manage this more efficient in c#?
You can do this with a suffix array. (Suffix trees will work fine too, but they require a bit more space, time, and care in implementation.)
Concatenate your two strings, separating them with a character that occurs in neither one. Then build a suffix array. Then you can read off your answer.
Standard suffix arrays give you a lexicographically sorted array of pointers to suffixes of the string together with a "longest common prefix length" array telling you how long the longest common prefix of two lexicographically consecutive suffixes is.
It is fairly straightforward to use the longest common prefix length array to get the information you want; find all maximal subarrays of the longest common prefix length array for which the longest common prefix length is at least the query length, then, for each one that has a match both in the first string and in the second string, report the appropriate prefix and report that it occurs K+1 times, where K is the length of the maximal subarray.
Another approach that's easier to code is to hash all substrings of the appropriate length. You can do this easily with any rolling hash function. Store a dynamic array of pointers into the strings for each hash; once you've hashed all the strings, iterate over all of the hashes that came up and look for matches. You'll need to deal with the false positives somehow; one (probabilistic) approach is to use several hash functions until the false positive probability is acceptably small. Another approach, which is likely only acceptable in the case where you have few matches, is to compare the strings directly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With