I want to write a simple text classifier by languages using unique letters, just for experiment.
For example, I have alphabet for each language as a dict of sets with the following keys: ['ru', 'uk', 'pl', 'en', 'de', 'be', ...].
For example, unique Polish letters are "ę" and "ś", English doesn't have unique ones.
In fact, I should find all letters that don't belong to others languages. I did it like this (simple example):
alphabets = {'it': {'a', 'b', 'c', 'd', 'e', 'à', 'ì'},
'en': {'a', 'b', 'c', 'd', 'e'},
'pl': {'a', 'b', 'c', 'd', 'e', 'ę', 'ś'}}
def union_others(except_lang):
res = set()
for lang in alphabets:
if lang != except_lang:
res = res | alphabets[lang]
return res
unique = {lang: set() for lang in alphabets}
for lang in alphabets:
unique[lang] = alphabets[lang] - union_others(lang)
print(unique['pl'])
I get the following output: {'ę', 'ś'}
Is there any simple way (without cycle) to get a union of all sets of languages except the current one instead of using union_others(lang) function?
You probably can't get away from a loop entirely, but you can make it slightly more efficient using short circuiting. In most cases, languages won't have any truly unique characters, so you can break out of your loop early, effectively avoiding the construction of the full set of other languages every time:
def delta(lang):
d = set(alphabets[lang]) # make a copy
for key, alphabet in alphabets.items():
if key == lang:
continue
d -= alphabet
if not d:
break
return d
unique = {lang: delta(lang) for lang in alphabets}
IDEOne Link
This will be a bit faster also because the set you are subtracting from has fewer elements almost immediately, speeding up the difference operation even further.
Now if you had some a-priori knowledge about the similarity of languages, you could use it to pre-sort alphabets for each language so that it's unique set would be reduced to a minimum almost immediately.
List comprehensions.
alphabets = {'it': {'a', 'b', 'c', 'd', 'e', 'à', 'ì'},
'en': {'a', 'b', 'c', 'd', 'e'},
'pl': {'a', 'b', 'c', 'd', 'e', 'ę', 'ś'}}
def f(k, d):
#return [x for x in d[k] if any(x not in v for k,v in d.items())]
return {x for x in d[k] if any(x not in v for k,v in d.items())}
print(f('pl', alphabets))
print(f('en', alphabets))
print(f('it', alphabets))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With