I'm trying to find a substring from a string text that is an anagram to a string pattern.
My Question: Could the Rabin-Karp algorithm be adjusted to this purpose? Or are there better algorithms?
I have tried out a brute-force algorithm, which did not work in my case because the text and the pattern can each be up to one million characters.
Update: I've heard there is a worst-case O(n2) algorithm that uses O(1) space. Does anyone know what this algorithm is?
Update 2: For reference, here is pseudocode for the Rabin-Karp algorithm:
function RabinKarp(string s[1..n], string sub[1..m])
    hsub := hash(sub[1..m]);  hs := hash(s[1..m])
    for i from 1 to n-m+1
       if hs = hsub
          if s[i..i+m-1] = sub
              return i
       hs := hash(s[i+1..i+m])
    return not found
This uses a rolling hash function to allow calculating the new hash in O(1),
so the overall search is O(nm) in the worst-case, but with a good hash function is O(m + n) in the best case.  Is there a rolling hash function that would produce few collisions when searching for anagrams of the string?
Compute a hash of the pattern that doesn't depend on the order of the letters in the pattern (for example, use the sum the character codes for each letter). Then apply the same hash function in "rolling" fashion to the text, as in Rabin-Karp. If the hashes match, you need to perform a full test of the pattern against the current window in the text, because the hash may collide with other values too.
By associating each symbol in your alphabet to a prime number, then computing the product of those prime numbers as your hash code, you will have fewer collisions.
There is, however, a bit of mathematical trickery that will assist you if you want to compute a running product like this: each time you step the window, multiply the running hash-code by the multiplicative inverse of the code for the symbol that is leaving the window, then multiply by the code for the symbol that is entering the window.
As an example, suppose you are computing the hash of letters 'a'–'z' as an unsigned, 64-bit value. Use a table like this:
symbol | code | code-1 -------+------+--------------------- a | 3 | 12297829382473034411 b | 5 | 14757395258967641293 c | 7 | 7905747460161236407 d | 11 | 3353953467947191203 e | 13 | 5675921253449092805 ... z | 103 | 15760325033848937303
The multiplicative inverse of n is the number that yields 1 when multiplied by n, modulo some number. The modulus here is 264, since you are using 64-bit numbers. So, 5 * 14757395258967641293 should be 1, for example. This works, because you are just multiplying in GF(264).
Computing a list of the first primes is easy, and your platform should have a library to efficiently compute the multiplicative inverse of these numbers.
Start coding with the number 3 because 2 is co-prime with the size of an integer (a power of 2 on whatever processor you are working on), and cannot be inverted.
One option would be to maintain a sliding window holding a histogram of the letters contained within the window. If that histogram ever ends up equal to the character histogram for the string whose anagram should be found, then you know that what you are looking at is a match and can output it. If not, you know that what you have cannot possibly be a match.
More concretely, create an associative array A mapping from characters to their frequencies. If you want to search for an anagram of string P, then read the first |P| characters from the text string T into A and build the histogram appropriately. You can slide the window one step forward and update A in O(1) associative array operations by decrementing the frequency associated with the first character in the window, then incrementing the frequency associated with the new character that has slid into the window.
If the histograms of the current window and the pattern window are very different, then you should be able to compare them rather quickly. Specifically, let's say that your alphabet is Σ. In the worst case, comparing two histograms would take time O(|Σ|), since you'd have to check each character/frequency pair in the histogram A with the reference histogram. In the best case, though, you'd immediately find a character that causes a mismatch between A and the reference histogram, so you would not need to look at many characters overall.
In theory the worst-case runtime for this approach is O(|T||Σ| + |P|), since you have to do O(n) work to build the initial histogram, then have to do worst-case Σ work per character in T. However, I'd expect that this is probably a lot faster in practice.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With