如何在Python Unicode字符串中最好地去除重音符号(标准化)?

790

我在Python中有一个Unicode字符串,我想要去掉所有的重音符号(变音符号)。

我在网上找到了一种优雅的方法(在Java中):

  1. 将Unicode字符串转换为其长规范形式(具有字母和变音符号的单独字符)
  2. 删除其Unicode类型为“变音符号”的所有字符。

我是否需要安装类库,如pyICU,还是只能用Python标准库? 而且对于Python 3呢?

重要说明:我希望避免使用显式映射从带重音符号的字符到它们的非带重音符号的对应项。

14个回答

6

这里已经有很多答案了,但之前并没有考虑使用sklearn

from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode

accented_string = u'Málagueña®'

print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena

如果您已经在使用sklearn处理文本,则这特别有用。这些是由类似CountVectorizer调用的内部函数,以规范化字符串:当使用strip_accents='ascii'时,会调用strip_accents_ascii,当使用strip_accents='unicode'时,则调用strip_accents_unicode

更多细节

最后,考虑一下其docstring中的这些细节:

Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing

Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.

并且

Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart

Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.

5

一些语言使用组合变音符号作为字母,同时使用重音变音符号来指定发音重音。

我认为更安全的做法是明确指定要去除的变音符号:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

1
如果您希望获得类似于Elasticsearch的asciifolding过滤器的功能,您可以考虑使用fold-to-ascii,它本身是Apache Lucene ASCII Folding Filter的Python移植版本,可将不在前127个ASCII字符(“Basic Latin”Unicode块)中的字母、数字和符号Unicode字符转换为ASCII等效字符(如果存在)。以下是上述页面的示例:
from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'

编辑fold_to_ascii 模块似乎很好地规范了基于拉丁字母的字母表;然而,无法映射的字符将被删除,这意味着该模块会将中文文本等转换为空字符串。如果您想保留中文、日文和其他Unicode字母表,请考虑使用上面 @mo-han 的 remove_accent_chars_regex 实现。


-2
我想出了这个(特别是为了拉丁字母-语言学目的)
import string
from functools import lru_cache

import unicodedata


# This can improve performance by avoiding redundant computations when the function is
# called multiple times with the same arguments.
@lru_cache
def lookup(
    l: str, case_sens: bool = True, replace: str = "", add_to_printable: str = ""
):
    r"""
    Look up information about a character and suggest a replacement.

    Args:
        l (str): The character to look up.
        case_sens (bool, optional): Whether to consider case sensitivity for replacements. Defaults to True.
        replace (str, optional): The default replacement character when not found. Defaults to ''.
        add_to_printable (str, optional): Additional uppercase characters to consider as printable. Defaults to ''.

    Returns:
        dict: A dictionary containing the following information:
            - 'all_data': A sorted list of words representing the character name.
            - 'is_printable_letter': True if the character is a printable letter, False otherwise.
            - 'is_printable': True if the character is printable, False otherwise.
            - 'is_capital': True if the character is a capital letter, False otherwise.
            - 'suggested': The suggested replacement for the character based on the provided criteria.
    Example:
        sen = "Montréal, über, 12.89, Mère, Françoise, noël, 889"
        norm = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='')['suggested'] for k in sen])
        print(norm)
        #########################
        sen2 = 'kožušček'
        norm2 = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='')['suggested'] for k in sen2])
        print(norm2)
        #########################

        sen3="Falsches Üben von Xylophonmusik quält jeden größeren Zwerg."
        norm3 = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='')['suggested'] for k in sen3]) # doesn't preserve ü - ue ...
        print(norm3)
        #########################
        sen4 = "cætera"
        norm4 = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='ae')['suggested'] for k in
                         sen4])  
        print(norm4)


        # Montreal, uber, 12.89, Mere, Francoise, noel, 889
        # kozuscek
        # Falsches Uben von Xylophonmusik qualt jeden groseren Zwerg.
        # caetera
    """
    # The name of the character l is retrieved using the unicodedata.name()
    # function and split into a list of words and sorted by len (shortest is the wanted letter)
    v = sorted(unicodedata.name(l).split(), key=len)
    sug = replace
    stri_pri = string.printable + add_to_printable.upper()
    is_printable_letter = v[0] in stri_pri
    is_printable = l in stri_pri
    is_capital = "CAPITAL" in v
    # Depending on the values of the boolean variables, the variable sug may be
    # updated to suggest a replacement for the character l. If the character is a printable letter,
    # the suggested replacement is set to the first word in the sorted list of names (v).
    # If case_sens is True and the character is a printable letter but not a capital,
    # the suggested replacement is set to the lowercase version of the first word in v.
    # If the character is printable, the suggested replacement is set to the character l itself.
    if is_printable_letter:
        sug = v[0]

        if case_sens:
            if not is_capital:
                sug = v[0].lower()
    elif is_printable:
        sug = l
    return {
        "all_data": v,
        "is_printable_letter": is_printable_letter,
        "is_printable": is_printable,
        "is_capital": is_capital,
        "suggested": sug,
    }

我提出了另一种解决方案,也基于查找字典和Numba,但源代码太大了,无法在这里发布。这是GitHub链接:https://github.com/hansalemaos/charchef


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接