I have a file in UTF-8, where some lines contain the U+2028 Line Separator character (http://www.fileformat.info/info/unicode/char/2028/index.htm). I don't want it to be treated as a line break when I read lines from the file. Is there a way to exclude it from separators when I iterate over the file or use readlines()? (Besides reading the entire file into a string and then splitting by \n.) Thank you!
I can't duplicate this behaviour in python 2.5, 2.6 or 3.0 on mac os x - U+2028 is always treated as non-endline. Could you go into more detail about where you see this error?
That said, here is a subclass of the "file" class that might do what you want:
#/usr/bin/python
# -*- coding: utf-8 -*-
class MyFile (file):
def __init__(self, *arg, **kwarg):
file.__init__(self, *arg, **kwarg)
self.EOF = False
def next(self, catchEOF = False):
if self.EOF:
raise StopIteration("End of file")
try:
nextLine= file.next(self)
except StopIteration:
self.EOF = True
if not catchEOF:
raise
return ""
if nextLine.decode("utf8")[-1] == u'\u2028':
return nextLine+self.next(catchEOF = True)
else:
return nextLine
A = MyFile("someUnicode.txt")
for line in A:
print line.strip("\n").decode("utf8")
I couldn't reproduce that behavior but here's a naive solution that just merges readline results until they don't end with U+2028.
#!/usr/bin/env python
from __future__ import with_statement
def my_readlines(f):
buf = u""
for line in f.readlines():
uline = line.decode('utf8')
buf += uline
if uline[-1] != u'\u2028':
yield buf
buf = u""
if buf:
yield buf
with open("in.txt", "rb") as fin:
for l in my_readlines(fin):
print l
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With