Unique letters in alphabet

Question

I want to write a simple text classifier by languages using unique letters, just for experiment.
For example, I have alphabet for each language as a dict of sets with the following keys: ['ru', 'uk', 'pl', 'en', 'de', 'be', ...]. For example, unique Polish letters are "ę" and "ś", English doesn't have unique ones. In fact, I should find all letters that don't belong to others languages. I did it like this (simple example):

alphabets = {'it': {'a', 'b', 'c', 'd', 'e', 'à', 'ì'},
             'en': {'a', 'b', 'c', 'd', 'e'},
             'pl': {'a', 'b', 'c', 'd', 'e', 'ę', 'ś'}}

def union_others(except_lang):
    res = set()
    for lang in alphabets:    
        if lang != except_lang:
            res = res | alphabets[lang]
   return res

unique = {lang: set() for lang in alphabets}
for lang in alphabets:
    unique[lang] = alphabets[lang] - union_others(lang)

print(unique['pl'])

I get the following output: {'ę', 'ś'}

Is there any simple way (without cycle) to get a union of all sets of languages except the current one instead of using union_others(lang) function?

Mad Physicist · Accepted Answer

You probably can't get away from a loop entirely, but you can make it slightly more efficient using short circuiting. In most cases, languages won't have any truly unique characters, so you can break out of your loop early, effectively avoiding the construction of the full set of other languages every time:

def delta(lang):
    d = set(alphabets[lang]) # make a copy
    for key, alphabet in alphabets.items():
        if key == lang:
            continue
        d -= alphabet
        if not d:
            break
    return d

unique = {lang: delta(lang) for lang in alphabets}

IDEOne Link

This will be a bit faster also because the set you are subtracting from has fewer elements almost immediately, speeding up the difference operation even further.

Now if you had some a-priori knowledge about the similarity of languages, you could use it to pre-sort alphabets for each language so that it's unique set would be reduced to a minimum almost immediately.

Işık Kaplan · Answer

List comprehensions.

alphabets = {'it': {'a', 'b', 'c', 'd', 'e', 'à', 'ì'},
             'en': {'a', 'b', 'c', 'd', 'e'},
             'pl': {'a', 'b', 'c', 'd', 'e', 'ę', 'ś'}}

def f(k, d):
    #return [x for x in d[k] if any(x not in v for k,v in d.items())]
    return {x for x in d[k] if any(x not in v for k,v in d.items())}


print(f('pl', alphabets))
print(f('en', alphabets))
print(f('it', alphabets))

Unique letters in alphabet

Tags:

python

Moris Huxley

2 Answers

Mad Physicist

Işık Kaplan

Recent Activity

Donate For Us

Unique letters in alphabet

Tags:

python

Moris Huxley

2 Answers

Mad Physicist

Işık Kaplan

Related questions

Recent Activity

Donate For Us