I have a text file in spanish, so it has thousands of words, some of them with accents. I'm using re module in order to extract some words, but when I got a list, some words are incomplete.
This is the first part of my code:
projectsinline = open('projectsinline.txt', 'r')
for lines in projectsinline:
pattern = r'\b[a-zA-Z]{6}\b'
words = re.findall(pattern, lines)
print words
This is an example of the output:
['creaci', 'Estado', 'relaci', 'Regula', 'estado', 'comisi', 'delito']
It should be like this:
['creación', 'Estado', 'relación', 'Regula', 'estado', 'comisión', 'delito']
I found this answer: Encode Python list to UTF-8 but it wasn't helpful, because my text comes from a text file, so I couldn't use this code:
import re
import codecs
import sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
projectsinline = open('projectsinline.txt', 'r')
for lines in projectsinline:
pattern = ur'\b[a-zA-Z]{6}\b'
unicode_pattern = re.compile(pattern, re.UNICODE)
result = unicode_pattern.findall(lines)
print result
Now, the output skips words that have accent.
Any suggestions to solve the problem are appreciated?
Thanks!
You are picking the words with 6 letters by using this r'\b[a-zA-Z]{6}\b',
some of the words in your example have more letters and those letters get cut off because your special symbols are considered as not word characters and word boundary works out.
I would use \w instead if you want all words with 6 letters.
will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
import re
import codecs
with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
for line in f:
unicode_pattern = re.compile(r'\b\w{6}\b', re.UNICODE)
result = unicode_pattern.findall(line)
for word in result:
print word
Example string:
creación, longstring, lación, Regula, estado, misión
Output:
lación
Regula
estado
misión
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With