how would i look for the shortest unique subsequence from a set of words in python?

Question

If i have a set of similar words such as:

\bigoplus
\bigotimes
\bigskip
\bigsqcup
\biguplus
\bigvee
\bigwedge
...
\zebra
\zeta

i would like to find the shortest unique set of letters that would characterize each word uniquely i.e.

\bigop:
     \bigoplus
\bigot:
     \bigotimes
\bigsk:
     \bigskip

EDIT: notice the unique sequence identifier always starts from the begining of the word. I writting an app that gives snippet suggestions when typing. So in general users will start typing from the start of the word

and so on, the sequence needs only be as long as is enough to characterize a word uniquely. EDIT: but needs to start from the begining of the word. The characterization always begins from the beginning of the word. My thoughts: i was thinking of sorting the words, and grouping based on the fist alphabetical letter, then probably use a longest common subsequence algorithm to find the longest subsequence in common, take its length and use length+1 chars for that unique substring, but im stuck since the algorithms i know for longest subsequence will usually only take two parameters at a time, and i may have more than two words in each group starting with a particular alphabetical letter. Im i solving an already solved probelem? google was no help.

Niklas B. · Accepted Answer

I'm assuming you want to find the prefixes that uniquely identify the strings, because if you could pick any subsequence, then for example om would be enough to identify \bigotimes in your example.

You can make use of the fact that for a given word, the word with the longest common prefix will be adjacent to it in lexicographical order. Since your dictionary seems to be sorted already, you can figure out the solution for every word by finding the longest prefix that disambiguates it from both its neighbors.

Example:

>>> lst = r"""
... \bigoplus
... \bigotimes
... \bigskip
... \bigsqcup
... \biguplus
... \bigvee
... \bigwedge
... """.split()
>>> lst.sort()      # necessary if lst is not already sorted
>>> lst = [""] + lst + [""]
>>> def cp(x): return len(os.path.commonprefix(x))
... 
>>> { lst[i]: 1 + max(cp(lst[i-1:i+1]), cp(lst[i:i+2])) for i in range(1,len(lst)-1) }
{'\bigvee': 5, 
 '\bigsqcup': 6, 
 '\biguplus': 5, 
 '\bigwedge': 5, 
 '\bigotimes': 6, 
 '\bigoplus': 6, 
 '\bigskip': 6}

The numbers indicate how long the minimal uniquely identifying prefix of a word is.

how would i look for the shortest unique subsequence from a set of words in python?

Tags:

string

algorithm

python-3.3

mike

1 Answers

Niklas B.

Recent Activity

Donate For Us

how would i look for the shortest unique subsequence from a set of words in python?

Tags:

string

algorithm

python-3.3

mike

1 Answers

Niklas B.

Related questions

Recent Activity

Donate For Us