I need to parse and output some data in table-like format. The input is in unicode encoding. Here is the test script:
#!/usr/bin/env python
s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'
print '1234567890'
print '%5s' % s1
print '%5s' % s2
It works as expected in case of the simple call like test.py
:
1234567890 abcd αβγδ
But if I try to redirect the output to the file test.py > a.txt
, I getting error:
Traceback (most recent call last): File "./test.py", line 8, in print '%5s' % s2 UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)
If I convert strings to UTF-8 encoding, like s2.encode('utf8')
redirection works fine, but data positions are broken:
1234567890 abcd αβγδ
How to force it to work properly in both cases?
It boils down to your output stream encoding. In this particular case, since you're using print
, the output file used is sys.stdout
.
stdout
not redirectedWhen you run Python in the interactive mode, or when you don't redirect stdout
to a file, Python uses encoding based on the environment, namely locale environment variables, like LC_CTYPE
. For example, if you run your program like this:
$ LC_CTYPE='en_US' python test.py
...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)
it will use ANSI_X3.4-1968
for sys.stdout
(see sys.stdout.encoding
) and fail. However, is you use UTF-8
(as you obviously already do):
$ LC_CTYPE='en_US.UTF-8' python test.py
1234567890
abcd
αβγδ
you'll get the expected output.
stdout
redirected to fileWhen you redirect stdout
to a file, Python will not try to detect encoding from your environment locale, but it will check another environment variable, PYTHONIOENCODING
(check the source, initstdio()
in Python/pylifecycle.c
). For example, this will work as expected:
$ PYTHONIOENCODING=utf-8 python test.py >/tmp/output
since Python will use UTF-8
encoding for /tmp/output
file.
stdout
encoding overrideYou can also manually re-open sys.stdout
with the desired encoding (check this and this SO question):
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Now print
will correctly output str
and unicode
objects, since the underlying stream writer will convert them to the UTF-8
on fly.
Of course, you can also manually encode each unicode
to UTF-8
str
prior to output with:
print ('%5s' % s2).encode('utf8')
but that's tedious and error-prone.
For completeness: when opening files for writing with a specific encoding (like UTF-8) in Python 2, you should use either io.open
or codecs.open
because they allow you to specify the encoding (see this question), unlike the built-in open
:
from codecs import open
myfile = open('filename', encoding='utf-8')
or:
from io import open
myfile = open('filename', encoding='utf-8')
You should encode '%5s' % s2
not s2
. So the following will have the expected output:
print ('%5s' % s2).encode('utf8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With