使用固定宽度格式化字符串（Unicode和UTF8）

Question

使用固定宽度格式化字符串（Unicode和UTF8）

4

我需要以类似表格的格式解析和输出一些数据。输入是Unicode编码的。以下是测试脚本：

#!/usr/bin/env python

s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'

print '1234567890'
print '%5s' % s1
print '%5s' % s2

如果像test.py这样的简单调用，则按预期工作：

1234567890
 abcd
 αβγδ

但是，如果我尝试将输出重定向到文件test.py > a.txt，则会出现错误：

Traceback (most recent call last):
  File "./test.py", line 8, in 
    print '%5s' % s2
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

如果我将字符串转换为UTF-8编码，例如s2.encode('utf8')，则重定向可以正常工作，但数据位置会被破坏：

1234567890
 abcd
αβγδ

如何使其在两种情况下都正常工作？

- Abelisto

3个回答

2

你应该对 '%5s' % s2 进行编码，而不是 s2。这样以下代码将会得到预期输出：

print ('%5s' % s2).encode('utf8')

- JuniorCompressor

1

在您的回答后，这变得显而易见了 :) 谢谢。 - Abelisto

1

print '%5s' % s1 是正确的，但是 print '%5s' % s2 是错误的。必须使用 print ('%5s' % s2).encode('utf8')

尝试使用这段代码。

#!/usr/bin/env python

s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'

print '1234567890' 
print '%5s' % s1
print ('%5s' % s2).encode('utf8')

- sameera lakshitha

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- randomir · Accepted Answer

这取决于您的输出流编码。在这种特定情况下，由于您使用了print，所使用的输出文件是sys.stdout。

交互模式 / 未重定向 `stdout`

当您在交互模式下运行Python，或者当您未将stdout重定向到文件时，Python会使用基于环境的编码，即区域设置环境变量，例如LC_CTYPE。例如，如果您像这样运行程序：

$ LC_CTYPE='en_US' python test.py
...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

如果使用 ANSI_X3.4-1968 作为 sys.stdout 的编码（参见 sys.stdout.encoding），则可能会出现问题。但是，如果使用 UTF-8（显然您已经这样做了）：

$ LC_CTYPE='en_US.UTF-8' python test.py
1234567890
 abcd
 αβγδ

您将获得预期的输出。

`stdout` 重定向到文件

当您将 stdout 重定向到文件时，Python 不会尝试从环境区域设置中检测编码，但它会检查另一个环境变量 PYTHONIOENCODING（请查看源代码，Python/pylifecycle.c 中的 initstdio()）。例如，这将按预期工作：

$ PYTHONIOENCODING=utf-8 python test.py >/tmp/output

由于Python将使用UTF-8编码的/tmp/output文件。

手动覆盖`stdout`编码

您还可以使用所需的编码手动重新打开sys.stdout（请参阅this和this SO问题）：

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

现在，print将正确输出str和unicode对象，因为底层流编写器将在运行时将它们转换为UTF-8。

手动进行字符串编码以进行输出

当然，您也可以使用以下方式手动将每个unicode编码为UTF-8str以进行输出：

print ('%5s' % s2).encode('utf8')

但这很繁琐且容易出错。

显式文件打开

为了完整起见：在Python 2中以特定编码（如UTF-8）打开文件进行写入时，应使用io.open或codecs.open，因为它们允许您指定编码（请参见this question），而内置的open则不允许。

from codecs import open
myfile = open('filename', encoding='utf-8')

或者：

from io import open
myfile = open('filename', encoding='utf-8')

使用固定宽度格式化字符串（Unicode和UTF8）

交互模式 / 未重定向 stdout

stdout 重定向到文件

手动覆盖stdout编码

手动进行字符串编码以进行输出

显式文件打开

交互模式 / 未重定向 `stdout`

`stdout` 重定向到文件

手动覆盖`stdout`编码