Encode Python list in order to use re module

Question

I have a text file in spanish, so it has thousands of words, some of them with accents. I'm using re module in order to extract some words, but when I got a list, some words are incomplete.

This is the first part of my code:

projectsinline = open('projectsinline.txt', 'r')

for lines in projectsinline:

    pattern = r'\b[a-zA-Z]{6}\b'
    words = re.findall(pattern, lines)

    print words

This is an example of the output:

['creaci', 'Estado', 'relaci', 'Regula', 'estado', 'comisi', 'delito']

It should be like this:

['creación', 'Estado', 'relación', 'Regula', 'estado', 'comisión', 'delito']

I found this answer: Encode Python list to UTF-8 but it wasn't helpful, because my text comes from a text file, so I couldn't use this code:

import re
import codecs
import sys

sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)

projectsinline = open('projectsinline.txt', 'r')

for lines in projectsinline:

    pattern = ur'\b[a-zA-Z]{6}\b'
    unicode_pattern = re.compile(pattern, re.UNICODE)
    result = unicode_pattern.findall(lines)
    print result

Now, the output skips words that have accent.

Any suggestions to solve the problem are appreciated?

Thanks!

midori · Accepted Answer

You are picking the words with 6 letters by using this r'\b[a-zA-Z]{6}\b', some of the words in your example have more letters and those letters get cut off because your special symbols are considered as not word characters and word boundary works out.

I would use \w instead if you want all words with 6 letters.

will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

import re
import codecs

with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
    for line in f:
        unicode_pattern = re.compile(r'\b\w{6}\b', re.UNICODE)
        result = unicode_pattern.findall(line)
        for word in result:
            print word

Example string:

creación, longstring, lación, Regula, estado, misión

Output:

lación
Regula
estado
misión

Encode Python list in order to use re module

Tags:

python

regex

encode

python-2.x

python-2.7

estebanpdl

1 Answers

midori

Recent Activity

Donate For Us

Encode Python list in order to use re module

Tags:

python

regex

encode

python-2.x

python-2.7

estebanpdl

1 Answers

midori

Related questions

Recent Activity

Donate For Us