Text segmentation: Algorithm to match input with the longest words from the dictionary

Question

I need to split a string into words such that each word comes from a dictionary. Also make sure that longest possible word from the left is chosen. Hence

thisisinsane => this is insane (correct as longest possible word from left)
thisisinsane => this is in sane(wrong) 
Assuming 'this', 'is', 'in', 'insane' are all words in the dictionary.

I managed to solved this problem by traversing from the end of the string to the beginning matching longest word possible. But problem started cropping us for problems like these..

shareasale => share as ale(wrong as 'ale' not in the dictionary)
shareasale => share a sale(correct)
Assuming 'share', 'a', 'sale' are all words in the dictionary unlike 'ale'.

I tried to solve this problem by removing the valid segments found before encountering the error i.e.

shareasale => 'share' and 'as' (error = 'ale')

and removing them once from the dictionary and then solve the problem. So

shareasale => no valid segments when share is removed
shareasale => share a sale (when 'as' is removed from the dictionary.

Thus I managed to solve this problem too. But then I am unable to solve this

asignas => 'as' ( error = 'ignas')

My solution will then remove 'as' from the dictionary and try to solve it

asignas => 'a' 'sign' (error = 'as')

Because in the new recursive call 'as' has been removed from the dictionary. The function I wrote is in this link. I hope someone can go through it and help me find a better algorithm to solve this else suggest modification to my existing algorithm.

Jakub Hampl · Accepted Answer

Essentially your problem is a tree problem, where at every level all the words that form a prefix of the tree form branches. The branch that leaves no part of the string is a correct solution.

                      thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /
                               /
                       (this,is,in,sane)

So in this example there are two solutions, but we want to select the solution using the longest words, that is we want to explore the tree from the right using a depth-first-search strategy.

So our algorithm should:

Sort the dictionary by descending length.
Find all prefixes of the current string. If there are none, return False.
Set prefix to the longest unexplored prefix.
Remove it from the string. If the string is empty, we found a solution, return a list of all prefixes.
Recurse to 2.
This branch has failed, return False.

A sample implementation of this solution:

def segment(string,wset):
    """Segments a string into words prefering longer words givens
    a dictionary wset."""
    # Sort wset in decreasing string order
    wset.sort(key=len, reverse=True)
    result = tokenize(string, wset, "")
    if result:
        result.pop() # Remove the empty string token
        result.reverse() # Put the list into correct order
        return result
    else:
        raise Exception("No possible segmentation!")

def tokenize(string, wset, token):
    """Returns either false if the string can't be segmented by 
    the current wset or a list of words that segment the string
    in reverse order."""
    # Are we done yet?
    if string == "":
        return [token]
    # Find all possible prefixes
    for pref in wset:
        if string.startswith(pref):
            res = tokenize(string.replace(pref, '', 1), wset, pref)
            if res:
                res.append(token)
                return res
    # Not possible
    return False

print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share a sale
print segment("asignas", ['as', 'sign', 'a'])                 # a sign as

Text segmentation: Algorithm to match input with the longest words from the dictionary

Tags:

python

algorithm

text-segmentation

coding_pleasures

1 Answers

Jakub Hampl

Recent Activity

Donate For Us

Text segmentation: Algorithm to match input with the longest words from the dictionary

Tags:

python

algorithm

text-segmentation

coding_pleasures

1 Answers

Jakub Hampl

Related questions

Recent Activity

Donate For Us