I'm looking for an algorithm that will take a vector of strings v1 and return a similar vector of strings v2 where each string is less than x characters long and is unique. The strings in v1 may not be unique.
While I need to accept ASCII in v1, I'd prefer to only insert alphanumeric characters ([A-Za-z0-9]) when insertion of new characters is required.
Obviously there are three caveats here:
For some values of v1 and x, there is no possible unique v2. For example, when v1 has 37 elements and x == 1.
"Similar" as specified in the question is subjective. The strings will be user facing, and presumably short natural language phrases (e.g. "number of colours"). I want a human to be able to map the original to the shortened string as easily as possible. This probably means taking advantage of heuristics such as disemvoweling. Because there is probably no objective measure of my similarness construct (string distance probably won't be the most useful here, although it might) my judgement on what is good will be arbitrary. The method should be suitable for English - other languages are irrelevant.
Obviously this is a (programming) language-agnostic problem, but I'd look favourably on an implementation in python (because I find its string processing language straight-forward).
a few notes / pointers about doing this in python.
v1 is already sorted (e.g. name and enemy will collide after disenvoweling).translate(None, "aeiouyAEIOUY") on the string.["aaa", "aaA", "aAa", "aAA"] etc. and if this is not enough "incrementing" characters starting from the end, until a non-colliding identifier is found, eg. ["aa"]*7 would become ["aa", "aA", "Aa", "AA", "ab", "aB", "Ab"]
Sketch -
Develop a list of functions that reduce the size of an english string. Order the functions from least to most obscuring.
For each string in v1 repeatedly apply an obscuring function until it can no longer reduce the size of the string and then move on to the next function.
When desired size x is achieved, verify reduced string is unique with respect to strings already in v2. If so, add it to v2, if not, continue to apply obscuring functions.
Following are some ideas for size reducing functions subjectively ordered from least to most obscuring. (The random selections are intended to increase the probability of the reduced string being unique.)
(Note: the last two functions would require access to the initial unaltered string, and the correspondences between the unaltered and altered words.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With