在Python中从字符串中剥离非可打印字符

Question

在Python中从字符串中剥离非可打印字符

110

我曾经经常跑步。

$s =~ s/[^[:print:]]//g;

使用Perl来清除非可打印字符。

在Python中没有POSIX正则表达式类，我不能编写[:print:]表示我想要的意思。在Python中，我不知道有什么方法可以检测字符是否可打印。

你会怎么做？

编辑：它必须支持Unicode字符。string.printable方式将愉快地从输出中剥离它们。curses.ascii.isprint对于任何Unicode字符都将返回false。

- Vinko Vrsalovic

使用PyPi regex模块，只需简单地使用regex.sub(r'[^[:print:]]+', '', text)即可。当然，还有很多其他选择。 - Wiktor Stribiżew

使用PyPi regex模块，只需简单地使用regex.sub(r'[^[:print:]]+', '', text)即可。当然，还有很多其他选择。 - undefined

16个回答

84

据我所知，最符合Python习惯/高效的方法是：

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

- William Keller

22

您可能希望使用以下代码：filtered_string = ''.join(filter(lambda x:x in string.printable, myStr))这样您将得到一个字符串。 - Nathan Shively-Sanders

21

遗憾的是，string.printable 不包含 Unicode 字符，因此 ü 或 ó 将不会出现在输出中...也许有其他解决办法？ - Vinko Vrsalovic

19

你应该使用列表推导式或生成器表达式，而不是filter + lambda。在99.9%的情况下，其中一个方法会更快。''.join(s for s in myStr if s in string.printable) - habnabit

3

99.9%更快？你从哪里得出这个数字的？性能比较并没有那么糟糕。 - Chris Morgan

4

嗨，William。这种方法似乎会删除所有非ASCII字符。Unicode中有许多可打印的非ASCII字符！ - dotancohen

显示剩余7条评论

20

你可以尝试使用 unicodedata.category() 函数设置筛选器：

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

请查看Unicode数据库字符属性第175页的表格4-9，了解可用的类别。

- Ber

1

你开始了一个列表推导式，但是在最后一行没有结束。我建议你完全删除开头的括号。 - tzot

2

这似乎是最直接、最简单的方法。谢谢。 - dotancohen

它应该是 printable = set(['Lu', 'Ll'])，对吧？ - Fabrizio Miano

2

@CsabaToth 三种写法都是合法的，并且会产生相同的集合。你的写法可能是指定集合字面值最好的方式。 - Ber

3

您可以将更多的Unicode类别添加到过滤器中。如果除了字母外还要保留空格和数字，请使用printable = {'Lu'，'Ll'，'Zs'，'Nd'}。 - Ber

显示剩余7条评论

17

以下代码适用于Unicode输入，且具有较快的处理速度...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

我的测试表明，这种方法比遍历字符串并使用str.join返回结果的函数更快。

- ChrisP

这是唯一适用于我使用Unicode字符的答案。很棒，你提供了测试用例！ - pir

1

如果您想要允許換行，請在建立表格時添加 LINE_BREAK_CHARACTERS = set(["\n", "\r"]) 以及 and not chr(i) in LINE_BREAK_CHARACTERS。 - pir

13

在Python 3中，

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

请看这个StackOverflow的帖子关于如何使用.translate()与正则表达式和.replace()相比移除标点符号

范围可以通过nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc')来生成，如@Ants Aasma所示，使用Unicode字符数据库类别

- shawnrad

最好使用Unicode范围（请参见@Ants Aasma的答案）。结果将是text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))})。 - darkdragon

8

这个函数使用列表推导和str.join，因此它的运行时间是线性的，而不是O(n^2)：

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

- Just Some Guy

7

Python 3 中的另一种选项：

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

- c6401

1

由于某些原因，这在Windows上运行得很好，但无法在Linux上使用。我不得不将“f”更改为“r”，但我不确定这是否是解决方案。 - Chop Labalagun

听起来你的Linux Python版本太旧，不支持f-strings。r-strings则完全不同，但你可以使用r'[^' + re.escape(string.printable) + r']'。（我认为在这里使用re.escape()并不完全正确，但如果它能工作...） - tripleee

1

遗憾的是，string.printable 不包含 Unicode 字符，因此 ü 或 ó 将不会出现在输出中... - the_economist

6

基于@Ber的答案，我建议只删除Unicode字符数据库类别中定义的控制字符：

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

- darkdragon

1

你可能对 startswith('C') 有所发现，但在我的测试中，这比任何其他解决方案都要慢得多。 - Big McLargeHuge

1

big-mclargehuge：我的解决方案的目标是完整性和简洁易读的结合。你可以尝试使用if unicodedata.category(c)[0] != 'C'代替。它的性能更好吗？如果你更注重执行速度而不是内存需求，可以像https://dev59.com/1XVD5IYBdhLWcg3wGHeu#93029中所示预先计算表格。 - darkdragon

5

用一种优雅的pythonic方法从字符串中去除“不可打印”字符的解决方案是，结合isprintable()字符串方法和生成器表达式或列表推导式（取决于字符串的大小）。

    ''.join(c for c in my_string if c.isprintable())

str.isprintable() 如果字符串中的所有字符都是可打印的或者字符串为空，则返回True，否则返回False。不可打印字符是Unicode字符数据库中被定义为“其他”或“分隔符”的那些字符，但ASCII空格（0x20）被视为可打印。（请注意，在此上下文中，“可打印”字符指在对字符串调用repr（）时不应该进行转义的字符。这与写入sys.stdout或sys.stderr的字符串处理无关。）

- Thomas Juul Dyhr

3

我现在想到的最好的方法是（感谢上面的Python转换器）

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

这是我发现的唯一适用于Unicode字符/字符串的方法

还有更好的选择吗？

- Vinko Vrsalovic

1

除非你使用的是Python 2.3版本，否则内部[]是多余的。“return ''.join(c for c ...)” - habnabit

并不完全冗余 - 它们具有不同的含义（和性能特征），尽管最终结果是相同的。 - Miles

另一端的范围是否也需要保护呢？：“ord(c) <= 126” - Gearoid Murphy

7

但是也有一些Unicode字符是无法打印的。 - tripleee

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ants Aasma · Accepted Answer

在Python中，遍历字符串的速度比较慢。对于这种情况，正则表达式比字符串遍历快一个数量级。你只需自己构建字符类即可。对于此，unicodedata模块非常有帮助，特别是unicodedata.category()函数。有关类别的描述，请参见Unicode字符数据库。

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于某些用例，可能更喜欢额外的类别（例如全部来自控制组），尽管这可能会显著减慢处理时间并增加内存使用。每个类别的字符数：

Cc （控制）：65
Cf （格式）：161
Cs （代理）：2048
Co （专用区）：137468
Cn （未分配）：836601

编辑添加了评论中的建议。