从Unicode格式的字符串中删除标点符号

Question

从Unicode格式的字符串中删除标点符号

45

我有一个函数可以从字符串列表中删除标点符号：

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

我最近修改了我的脚本以使用Unicode字符串，以便处理其他非西方字符。当遇到这些特殊字符时，此函数会出现故障，并只返回空的Unicode字符串。我如何可靠地从Unicode格式的字符串中删除标点符号？

- acpigeon

5

strip_punctuation() 应该接受字符串而不是字符串列表，然后如果需要的话，可以使用 list_of_strings = map(strip_punctuation, list_of_strings) 将其转换为字符串列表。 - jfs

那可能是更好的方式。我喜欢你和F.C.使用Unicode类别的实现。 - acpigeon

4个回答

28

如果你想在Python 3中使用J.F. Sebastian的解决方案：

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

- metakermit

text.translate({i: None for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')}) -- 如果你必须最小化使用它（字典大约有800个键），但是最好预先生成这个字典。 - ingyhere

9

你可以使用unicodedata 模块中的category 函数来遍历字符串, 以确定字符是否为标点符号。根据 category 的可能输出, 可参考 unicode.org 上关于通用类别值的文档。

import unicodedata.category as cat
def strip_punctuation(word):
    return "".join(char for char in word if cat(char).startswith('P'))
filtered = [strip_punctuation(word) for word in input]

此外，确保你正确处理编码和类型。这个演示是一个很好的入门（链接）。

- Daenyth

+1 for unipain link。我正在尝试实现这个，但是在result[i]行上出现了“IndexError: list assignment index out of range”的错误。我会继续尝试解决。 - acpigeon

1

@acpigeon: 出于某些原因，我曾认为您可以在不预先填充的情况下以稀疏方式分配列表。现已编辑为更好的方法。 - Daenyth

2

这个答案中有一个小但重要的错误：strip_punctuation实际上与您的意图相反，并且将仅返回标点符号，因为您在推导式中忘记了一个“not”。我想编辑答案以修复它，但是“编辑必须至少为6个字符”。 - Edward

8

基于Daenyth的回答的稍短版本

import unicodedata

def strip_punctuation(text):
    """
    >>> strip_punctuation(u'something')
    u'something'

    >>> strip_punctuation(u'something.,:else really')
    u'somethingelse really'
    """
    punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'])
    return ''.join(x for x in text
                   if unicodedata.category(x) not in punctutation_cats)

input_data = [u'somehting', u'something, else', u'nothing.']
without_punctuation = map(strip_punctuation, input_data)

- Facundo Casco

OP说input_data是一个字符串列表，而不仅仅是一个字符串。（当然，你可以将你的版本映射到它上面） - Daenyth

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jfs · Accepted Answer

76

您可以使用 unicode.translate() 方法：

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

你也可以使用 r'\p{P}'，这是由正则模块支持的：

import regex as re

def remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

- jfs

8

对于建议使用正则表达式，我给予赞同，这是解决这个问题的最佳方式。需要注意的是，它目前还不是标准功能，需要单独安装。另外，在Python2中，您需要将正则表达式定义为Unicode字符串（使用ur".."）来启用Unicode匹配模式。 - georg

3

re 模块（而不是 regex）似乎不支持 \p{P}，对吗？ - ratsimihah

1

哦，我没意识到 regex 是一个 pypi 模块。谢谢！ - ratsimihah

4

@posdef 这是Python 2的代码（请阅读第一条评论）。在Python 3中，在r''之前删除u''前缀，或者使用u"\\p{P}+"（在这种情况下，您必须手动转义反斜杠）。 - jfs

1

@DennisGolomazov：没错。|（U+007C）是一个数学符号：\p{Sm}，它不是Unicode标点符号。也许你想要的是\p{posix_punct}（[[:punct:]]）。根据您的具体情况，指定要保留的字符可能更简单。如果您有特定的要求清单（要保留什么，要删除什么），这可能是一个很好的单独问题。 - jfs

显示剩余13条评论