Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm for phrase anagrams

What is an efficient way to produce phrase anagrams given a string?

The problem I am trying to solve

Assume you have a word list with n words. Given an input string, say, "peanutbutter", produce all phrase anagrams. Some contenders are: pea nut butter, A But Ten Erupt, etc.

My solution

I have a trie that contains all words in the given word list. Given an input string, I calculate all permutations of it. For each permutation, I have a recursive solution (something like this) to determine if that specific permuted string can be broken in to words. For example, if one of the permutations of peanutbutter was "abuttenerupt", I used this method to break it into "a but ten erupt". I use the trie to determine if a string is a valid word.

What sucks

My problem is that because I calculate all permutations, my solution runs very slow for phrases that are longer than 10 characters, which is a big let down. I want to know if there is a way to do this in a different way. Websites like https://wordsmith.org/anagram/ can do the job in less than a second and I am curious to know how they do it.

like image 558
Ravi Avatar asked Nov 04 '25 21:11

Ravi


1 Answers

Your problem can be decomposed to 2 sub-problems:

  1. Find combination of words that use up all characters of the input string
  2. Find all permutations of the words found in the first sub-problem

Subproblem #2 is a basic algorithm and you can find existing standard implementation in most programming language. Let's focus on subproblem #1

First convert the input string to a "character pool". We can implement the character pool as an array oc, where oc[c] = number of occurrence of character c.

Then we use backtracking algorithm to find words that fit in the charpool as in this pseudo-code:

result = empty;

function findAnagram(pool)
  if (pool empty) then print result;
  for (word in dictionary) {
    if (word fit in charpool) {
      result = result + word;
      update pool to exclude characters in word;
      findAnagram(pool);

      // as with any backtracking algorithm, we have to restore global states
      restore pool;
      restore result;
    }
  }
}

Note: If we pass the charpool by value then we don't have to restore it. But as it is quite big, I prefer passing it by reference.

Now we remove redundant results and apply some optimizations:

  • Assuming A comes before B in the dictionary. If we choose the first word is B, then we don't have to consider word A in following steps, because those results (if we take A) would already be in the case where A is chosen as the first word

  • If the character set is small enough (< 64 characters is best), we can use a bitmask to quickly filter words that cannot fit in the pool. A bitmask mask which character is in a word, no matter how many time it occurs.

Update the pseudo-code to reflect those optimizations:

function findAnagram(charpool, minDictionaryIndex)
  pool_bitmask <- bitmask(charpool);
  if (pool empty) then print result;
  for (word in dictionary AND word's index >= minDictionaryIndex) {
    // bitmask of every words in the dictionary should be pre-calculated
    word_bitmask <- bitmask(word)
    if (word_bitmask contains bit(s) that is not in pool_bitmask)
      then skip this for iteration
    if (word fit in charpool) {
      result = result + word;
      update charpool to exclude characters in word;
      findAnagram(charpool, word's index);

      // as with any backtracking algorithm, we have to restore global states
      restore pool;
      restore result;
    }
  }
}

My C++ implementation of subproblem #1 where the character set contains only lowercase 'a'..'z': http://ideone.com/vf7Rpl .

like image 187
Can Nguyen Avatar answered Nov 07 '25 01:11

Can Nguyen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!