I am trying to syllabify devanagari words
धर्मक्षेत्रे -> धर् मक् षेत् रे dharmakeshetre -> dhar mak shet re
wd.split('्')
I get the result as :
['धर', 'मक', 'षेत', 'रे']
Which is partially correct
I try another word कुरुक्षेत्र -> कु रुक् षेत् रे kurukshetre -> ku ruk she tre
['कुरुक', 'षेत', 'रे']
The result is obviously wrong.
How do I extract the syllables effectively?
If you look at your strings character by character
>>> data = "कुरुक्षेत्र"
>>> re.findall(".", data)
['क', 'ु', 'र', 'ु', 'क', '्', 'ष', 'े', 'त', '्', 'र']
And your other string
>>> data = "धर्मक्षेत्रे"
>>> re.findall(".", data)
['ध', 'र', '्', 'म', 'क', '्', 'ष', 'े', 'त', '्', 'र', 'े']
So what you want is probably split these using '् '्. Let's call them notation characters for now. If you print the ord(data[2])for the first notation character, it is 2381. Now if you probe around this value
>>> for i in range(2350, 2400):
...     print(i, chr(i))
...
2350 म
2351 य
2352 र
2353 ऱ
2354 ल
2355 ळ
2356 ऴ
2357 व
2358 श
2359 ष
2360 स
2361 ह
2362 ऺ
2363 ऻ
2364 ़
2365 ऽ
2366 ा
2367 ि
2368 ी
2369 ु
2370 ू
2371 ृ
2372 ॄ
2373 ॅ
2374 ॆ
2375 े
2376 ै
2377 ॉ
2378 ॊ
2379 ो
2380 ौ
2381 ्
2382 ॎ
2383 ॏ
2384 ॐ
2385 ॑
2386 ॒
2387 ॓
2388 ॔
2389 ॕ
2390 ॖ
2391 ॗ
2392 क़
2393 ख़
2394 ग़
2395 ज़
2396 ड़
2397 ढ़
2398 फ़
2399 य़
We are mostly interested in in values between 2362 and 2391. So we create a array of such values
>>> split = ""
>>> for i in range(2362, 2392):
...     split += chr(i)
Next we want to find all pattern with or without a corresponding notation symbol.
>>> re.findall(".[" + split + "]?", "धर्मक्षेत्रे")
['ध', 'र्', 'म', 'क्', 'षे', 'त्', 'रे']
>>> re.findall(".[" + split + "]?", "कुरुक्षेत्र")
['कु', 'रु', 'क्', 'षे', 'त्', 'र']
This should get you close to what you are probably looking for. If you need more complex handling then you will have to go with the link @OphirYoktan posted
Check out unicodedata module.
>>> import unicodedata
>>> word = 'कुरुक्षेत्र'
Names assigned to each character:
>>> for ch in word:
        print(unicodedata.name(ch))
    
DEVANAGARI LETTER KA
DEVANAGARI VOWEL SIGN U
DEVANAGARI LETTER RA
DEVANAGARI VOWEL SIGN U
DEVANAGARI LETTER KA
DEVANAGARI SIGN VIRAMA
DEVANAGARI LETTER SSA
DEVANAGARI VOWEL SIGN E
DEVANAGARI LETTER TA
DEVANAGARI SIGN VIRAMA
DEVANAGARI LETTER RA
General category assigned to each character:
>>> for ch in word:
        print(unicodedata.category(ch))
    
Lo
Mn
Lo
Mn
Lo
Mn
Lo
Mn
Lo
Mn
Lo
FileFormat.info has a list of Unicode character categories.
See if this is what you want to achieve:
import unicodedata
def split_clusters(txt):
    """ Generate grapheme clusters for the Devanagari text."""
    stop = '्'
    cluster = u''
    end = None
    for char in txt:
        category = unicodedata.category(char)
        if (category == 'Lo' and end == stop) or category[0] == 'M':
            cluster = cluster + char        
        else:
            if cluster:
                yield cluster
            cluster = char
        end = char
    if cluster:
        yield cluster
Testing the function:
>>> list(split_clusters('धर्मक्षेत्रे'))
['ध', 'र्म', 'क्षे', 'त्रे']
>>> list(split_clusters('कुरुक्षेत्र'))
['कु', 'रु', 'क्षे', 'त्र']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With