Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract only text in Hindi from a file containing both Hindi and English

I have a file containing lines like

 ted    1-1 1.0 politicians do not have permission to do what needs to be 
 done.  

 राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है.

I have to write a program which reads the file line by line and gives the output in a file containing only the Hindi part. Here the first word indicates the source of the last two segments. Also, the last two sentences are translations of each other. Basically, I am trying to create a parallel corpus out of this file.

like image 766
Pritesh Ranjan Avatar asked Jan 23 '26 14:01

Pritesh Ranjan


1 Answers

you can do this by checking Unicode character.

import codecs,string
def detect_language(character):
    maxchar = max(character)
    if u'\u0900' <= maxchar <= u'\u097f':
        return 'hindi'

with codecs.open('letter.txt', encoding='utf-8') as f:
    input = f.read()
    for i in input:
        isEng = detect_language(i)
        if isEng == "hindi":
            #Hindi Character
            #add this to another file
            print(i,end="\t")
            print(isEng)

Hope this helps

like image 138
Rehan Shikkalgar Avatar answered Jan 26 '26 06:01

Rehan Shikkalgar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!