Python utf-8，如何对齐打印输出

Question

Python utf-8，如何对齐打印输出

10

我有一个包含日文字符和“普通”字符的数组。如何对其进行排列打印？

#!/usr/bin/python
# coding=utf-8

a1=['する', 'します', 'trazan', 'した', 'しました']
a2=['dipsy', 'laa-laa', 'banarne', 'po', 'tinky winky']

for i,j in zip(a1,a2):
    print i.ljust(12),':',j

print '-'*8

for i,j in zip(a1,a2):
    print i,len(i)
    print j,len(j)

输出：

する       : dipsy
します    : laa-laa
trazan       : banarne
した       : po
しました : tinky winky
--------
する 6
dipsy 5
します 9
laa-laa 7
trazan 6
banarne 7
した 6
po 2
しました 12
tinky winky 11

谢谢，
//Fredrik

- Fredrik Pihl

我认为对于日本人来说，你们有一种“正常”的和罗马字混合的语言。而对于泰国人来说... - MtnViewMark

3个回答

2

使用Unicode对象而不是字节字符串：

#!/usr/bin/python
# coding=utf-8

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    print i.ljust(12),':',j

print '-'*8

for i,j in zip(a1,a2):
    print i,len(i)
    print j,len(j)

Unicode对象直接处理字符。

- jcdyer

使用u'string'时，我遇到了UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)的错误。通过执行print j.encoding('utf-8')解决了这个问题，但这种方法似乎非常笨拙... - Fredrik Pihl

@jleedev——我的控制台显示不同。你能具体说明一下吗？你得到了什么结果？@Fredrik——听起来像是你的终端想使用Latin-1编码。你必须找到一种方法说服它使用UTF-8，或者将输出写入文件而不是打印（我建议import codecs; f = codecs.open('output.txt', encoding='utf-8')）。祝你好运！ - jcdyer

@jleedev，啊，我明白了。这在某种程度上取决于你的字体，Python 无法解决这个问题，但它确实修复了第二个“for”循环中的字符计数问题。 - jcdyer

1

你需要手动构建字符串，并且手动构建格式长度。没有简单的方法来完成这个任务。

以下三个函数可以实现此功能（需要unicodedata）：

shortenStringCJK：正确缩短长度以适合某些输出（而不是截取X个字符的长度）。

def shortenStringCJK(string, width, placeholder='..'):
# get the length with double byte charactes
string_len_cjk = stringLenCJK(str(string))
# if double byte width is too big
if string_len_cjk > width:
    # set current length and output string
    cur_len = 0
    out_string = ''
    # loop through each character
    for char in str(string):
        # set the current length if we add the character
        cur_len += 2 if unicodedata.east_asian_width(char) in "WF" else 1
        # if the new length is smaller than the output length to shorten too add the char
        if cur_len <= (width - len(placeholder)):
            out_string += char
    # return string with new width and placeholder
    return "{}{}".format(out_string, placeholder)
else:
    return str(string)

stringLenCJK：获取正确的长度（如在终端上占用的空间）

def stringLenCJK(string):
    # return string len including double count for double width characters
    return sum(1 + (unicodedata.east_asian_width(c) in "WF") for c in string)

formatLen：将长度格式化以调整双字节字符的宽度。如果没有这个，长度将不平衡。

def formatLen(string, length):
    # returns length udpated for string with double byte characters
    # get string length normal, get string length including double byte characters
    # then subtract that from the original length
    return length - (stringLenCJK(string) - len(string))

要输出一些字符串：预定义格式字符串。

format_str = "|{{:<{len}}}|"
format_len = 26
string_len = 26

并输出如下结果（其中_string是要输出的字符串）

print("Normal : {}".format(
    format_str.format(
        len=formatLen(shortenStringCJK(_string, width=string_len), format_len))
    ).format(
        shortenStringCJK(_string, width=string_len)
    )
)

- Clemens Schwaighofer

感谢您对我8年前的问题的回答 :-) - Fredrik Pihl

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Josh Lee · Accepted Answer

使用unicodedata.east_asian_width函数，在计算字符串长度时，跟踪哪些字符是窄的和宽的。

#!/usr/bin/python
# coding=utf-8

import sys
import codecs
import unicodedata

out = codecs.getwriter('utf-8')(sys.stdout)

def width(string):
    return sum(1+(unicodedata.east_asian_width(c) in "WF")
        for c in string)

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    out.write('%s %s: %s\n' % (i, ' '*(12-width(i)), j))

输出：

する          : dipsy
します        : laa-laa
trazan        : banarne
した          : po
しました      : tinky winky

在一些网络浏览器字体中，它看起来不正确，但在终端窗口中，它们对齐得很好。