Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unique letters in alphabet

Tags:

python

I want to write a simple text classifier by languages using unique letters, just for experiment.
For example, I have alphabet for each language as a dict of sets with the following keys: ['ru', 'uk', 'pl', 'en', 'de', 'be', ...]. For example, unique Polish letters are "ę" and "ś", English doesn't have unique ones. In fact, I should find all letters that don't belong to others languages. I did it like this (simple example):

alphabets = {'it': {'a', 'b', 'c', 'd', 'e', 'à', 'ì'},
             'en': {'a', 'b', 'c', 'd', 'e'},
             'pl': {'a', 'b', 'c', 'd', 'e', 'ę', 'ś'}}

def union_others(except_lang):
    res = set()
    for lang in alphabets:    
        if lang != except_lang:
            res = res | alphabets[lang]
   return res

unique = {lang: set() for lang in alphabets}
for lang in alphabets:
    unique[lang] = alphabets[lang] - union_others(lang)

print(unique['pl'])

I get the following output: {'ę', 'ś'}

Is there any simple way (without cycle) to get a union of all sets of languages except the current one instead of using union_others(lang) function?

like image 285
Moris Huxley Avatar asked Jan 18 '26 17:01

Moris Huxley


2 Answers

You probably can't get away from a loop entirely, but you can make it slightly more efficient using short circuiting. In most cases, languages won't have any truly unique characters, so you can break out of your loop early, effectively avoiding the construction of the full set of other languages every time:

def delta(lang):
    d = set(alphabets[lang]) # make a copy
    for key, alphabet in alphabets.items():
        if key == lang:
            continue
        d -= alphabet
        if not d:
            break
    return d

unique = {lang: delta(lang) for lang in alphabets}

IDEOne Link

This will be a bit faster also because the set you are subtracting from has fewer elements almost immediately, speeding up the difference operation even further.

Now if you had some a-priori knowledge about the similarity of languages, you could use it to pre-sort alphabets for each language so that it's unique set would be reduced to a minimum almost immediately.

like image 163
Mad Physicist Avatar answered Jan 21 '26 07:01

Mad Physicist


List comprehensions.

alphabets = {'it': {'a', 'b', 'c', 'd', 'e', 'à', 'ì'},
             'en': {'a', 'b', 'c', 'd', 'e'},
             'pl': {'a', 'b', 'c', 'd', 'e', 'ę', 'ś'}}

def f(k, d):
    #return [x for x in d[k] if any(x not in v for k,v in d.items())]
    return {x for x in d[k] if any(x not in v for k,v in d.items())}


print(f('pl', alphabets))
print(f('en', alphabets))
print(f('it', alphabets))
like image 33
Işık Kaplan Avatar answered Jan 21 '26 08:01

Işık Kaplan