Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encode Python list in order to use re module

I have a text file in spanish, so it has thousands of words, some of them with accents. I'm using re module in order to extract some words, but when I got a list, some words are incomplete.

This is the first part of my code:

projectsinline = open('projectsinline.txt', 'r')

for lines in projectsinline:

    pattern = r'\b[a-zA-Z]{6}\b'
    words = re.findall(pattern, lines)

    print words

This is an example of the output:

['creaci', 'Estado', 'relaci', 'Regula', 'estado', 'comisi', 'delito']

It should be like this:

['creación', 'Estado', 'relación', 'Regula', 'estado', 'comisión', 'delito']

I found this answer: Encode Python list to UTF-8 but it wasn't helpful, because my text comes from a text file, so I couldn't use this code:

import re
import codecs
import sys

sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)

projectsinline = open('projectsinline.txt', 'r')

for lines in projectsinline:

    pattern = ur'\b[a-zA-Z]{6}\b'
    unicode_pattern = re.compile(pattern, re.UNICODE)
    result = unicode_pattern.findall(lines)
    print result

Now, the output skips words that have accent.

Any suggestions to solve the problem are appreciated?

Thanks!

like image 319
estebanpdl Avatar asked Nov 23 '25 21:11

estebanpdl


1 Answers

You are picking the words with 6 letters by using this r'\b[a-zA-Z]{6}\b', some of the words in your example have more letters and those letters get cut off because your special symbols are considered as not word characters and word boundary works out.

I would use \w instead if you want all words with 6 letters.

will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

import re
import codecs

with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
    for line in f:
        unicode_pattern = re.compile(r'\b\w{6}\b', re.UNICODE)
        result = unicode_pattern.findall(line)
        for word in result:
            print word

Example string:

creación, longstring, lación, Regula, estado, misión

Output:

lación
Regula
estado
misión
like image 102
midori Avatar answered Nov 26 '25 10:11

midori



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!