If i have a set of similar words such as:
\bigoplus
\bigotimes
\bigskip
\bigsqcup
\biguplus
\bigvee
\bigwedge
...
\zebra
\zeta
i would like to find the shortest unique set of letters that would characterize each word uniquely i.e.
\bigop:
\bigoplus
\bigot:
\bigotimes
\bigsk:
\bigskip
EDIT: notice the unique sequence identifier always starts from the begining of the word. I writting an app that gives snippet suggestions when typing. So in general users will start typing from the start of the word
and so on, the sequence needs only be as long as is enough to characterize a word uniquely. EDIT: but needs to start from the begining of the word. The characterization always begins from the beginning of the word. My thoughts: i was thinking of sorting the words, and grouping based on the fist alphabetical letter, then probably use a longest common subsequence algorithm to find the longest subsequence in common, take its length and use length+1 chars for that unique substring, but im stuck since the algorithms i know for longest subsequence will usually only take two parameters at a time, and i may have more than two words in each group starting with a particular alphabetical letter. Im i solving an already solved probelem? google was no help.
I'm assuming you want to find the prefixes that uniquely identify the strings, because if you could pick any subsequence, then for example om would be enough to identify \bigotimes in your example.
You can make use of the fact that for a given word, the word with the longest common prefix will be adjacent to it in lexicographical order. Since your dictionary seems to be sorted already, you can figure out the solution for every word by finding the longest prefix that disambiguates it from both its neighbors.
Example:
>>> lst = r"""
... \bigoplus
... \bigotimes
... \bigskip
... \bigsqcup
... \biguplus
... \bigvee
... \bigwedge
... """.split()
>>> lst.sort() # necessary if lst is not already sorted
>>> lst = [""] + lst + [""]
>>> def cp(x): return len(os.path.commonprefix(x))
...
>>> { lst[i]: 1 + max(cp(lst[i-1:i+1]), cp(lst[i:i+2])) for i in range(1,len(lst)-1) }
{'\\bigvee': 5,
'\\bigsqcup': 6,
'\\biguplus': 5,
'\\bigwedge': 5,
'\\bigotimes': 6,
'\\bigoplus': 6,
'\\bigskip': 6}
The numbers indicate how long the minimal uniquely identifying prefix of a word is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With