如何在.NET中从字符串中删除变音符号(重音符号)?

547
我尝试转换一些法裔加拿大的字符串,基本上,我想能够去掉字母上的法语重音标记并保留字母本身。(例如将é 转换为 e,因此 crème brûlée 将变为 creme brulee)最好的方法是什么?

21
警告:这种方法在某些特定情况下可能有效,但通常情况下不能仅仅去掉变音符号。在某些情况和某些语言中,这可能会改变文本的含义。你没有说明想要这么做的原因;如果是为了比较字符串或搜索,最好使用支持Unicode的库来完成。 - JacquesB
2
由于大多数实现此目的的技术都依赖于Unicode规范化,因此阅读描述该标准的文档可能会很有用:http://www.unicode.org/reports/tr15/ - LuddyPants
我认为Azure团队已经解决了这个问题,我尝试上传一个名为“Mémo de la réunion.pdf”的文件,操作成功了。 - Rady
1
在我们的情况下,限制来自于Postgres数据库中的ltree数据类型。其中ltree仅允许使用[a-zA-Z0-9_]。而对于我们的情况,确实需要进行快速搜索。 - Mike de Klerk
22个回答

1
这段代码对我起作用了:
var updatedText = text.Normalize(NormalizationForm.FormD)
     .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
     .ToArray();

然而,请不要这样处理名称。 这不仅是对于名字中带有umlauts/accents的人的一种侮辱,它在某些情况下也可能是非常错误和危险的(见下文)。除了删除重音符号之外,还有其他的书写方式。

此外,这是错误的和危险的,例如,如果用户必须提供与护照上完全相同的名称。

例如,我的名字写作Zuberbühler,而在护照的机器可读部分中,您会发现Zuberbuehler。通过删除重音符号,名称将无法与任何部分匹配。这可能会给用户带来问题。

相反,您应该禁止在输入表单中使用umlauts/accent,以便用户可以正确地书写其姓名而没有umlaut或accent。

举个实际的例子,如果申请 ESTA 的 Web 服务 (https://www.application-esta.co.uk/special-characters-and) 使用上述代码而不是正确地转换 umlauts,ESTA 申请要么会被拒绝,要么当旅客入境美国时会遇到问题。

另一个例子是机票。假设你有一个机票预订 Web 应用程序,用户使用带有音调符号的名字,而你的实现只是移除音调符号,然后使用航空公司的 Web 服务预订机票!由于姓名与护照上的任何部分都不匹配,你的客户可能无法登机。


这对于韩语不起作用,需要使用FormC。 - aepot

-3

显然,由于声望不足,我无法评论亚历山大的优秀链接。- Lucene似乎是在相当通用的情况下工作的唯一解决方案。

对于那些想要简单复制粘贴解决方案的人,这里有一个利用Lucene中的代码的解决方案:

字符串测试 =“ÁÂÄÅÇÉÍÎÓÖØÚÜÞàáâãäåæçèéêëìíîïðñóôöøúüāăčĐęğıŁłńŌōřŞşšźžșțệủ”;

Console.WriteLine(Lucene.latinizeLucene(testbed));

AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu (将原文直接转换为中文没有意义)

//////////

public static class Lucene
{
    // source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
    // idea: https://dev59.com/zXVC5IYBdhLWcg3wliGe (scroll down, search for lucene by Alexander)
    public static string latinizeLucene(string arg)
    {
        char[] argChar = arg.ToCharArray();

        // latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ⑽ to (10)
        char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();

        int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);

        string ret = new string(resultChar);
        ret = ret.Substring(0, outputPos);

        return ret;
    }

    /// <summary>
    /// Converts characters above ASCII to their ASCII equivalents.  For example,
    /// accents are removed from accented characters. 
    /// <para/>
    /// @lucene.internal
    /// </summary>
    /// <param name="input">     The characters to fold </param>
    /// <param name="inputPos">  Index of the first character to fold </param>
    /// <param name="output">    The result of the folding. Should be of size >= <c>length * 4</c>. </param>
    /// <param name="outputPos"> Index of output where to put the result of the folding </param>
    /// <param name="length">    The number of characters to fold </param>
    /// <returns> length of output </returns>
    private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
    {
        int end = inputPos + length;
        for (int pos = inputPos; pos < end; ++pos)
        {
            char c = input[pos];

            // Quick test: if it's not in range then just keep current character
            if (c < '\u0080')
            {
                output[outputPos++] = c;
            }
            else
            {
                switch (c)
                {
                    case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
                    case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
                    case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
                    case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
                    case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
                    case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
                    case '\u0100': // Ā  [LATIN CAPITAL LETTER A WITH MACRON]
                    case '\u0102': // Ă  [LATIN CAPITAL LETTER A WITH BREVE]
                    case '\u0104': // Ą  [LATIN CAPITAL LETTER A WITH OGONEK]
                    case '\u018F': // Ə  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
                    case '\u01CD': // Ǎ  [LATIN CAPITAL LETTER A WITH CARON]
                    case '\u01DE': // Ǟ  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E0': // Ǡ  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FA': // Ǻ  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0200': // Ȁ  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
                    case '\u0202': // Ȃ  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
                    case '\u0226': // Ȧ  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
                    case '\u023A': // Ⱥ  [LATIN CAPITAL LETTER A WITH STROKE]
                    case '\u1D00': // ᴀ  [LATIN LETTER SMALL CAPITAL A]
                    case '\u1E00': // Ḁ  [LATIN CAPITAL LETTER A WITH RING BELOW]
                    case '\u1EA0': // Ạ  [LATIN CAPITAL LETTER A WITH DOT BELOW]
                    case '\u1EA2': // Ả  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
                    case '\u1EA4': // Ấ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA6': // Ầ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA8': // Ẩ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAA': // Ẫ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAC': // Ậ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAE': // Ắ  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB0': // Ằ  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB2': // Ẳ  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB4': // Ẵ  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB6': // Ặ  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u24B6': // Ⓐ  [CIRCLED LATIN CAPITAL LETTER A]
                    case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
                        output[outputPos++] = 'A';
                        break;
                    case '\u00E0': // à  [LATIN SMALL LETTER A WITH GRAVE]
                    case '\u00E1': // á  [LATIN SMALL LETTER A WITH ACUTE]
                    case '\u00E2': // â  [LATIN SMALL LETTER A WITH CIRCUMFLEX]
                    case '\u00E3': // ã  [LATIN SMALL LETTER A WITH TILDE]
                    case '\u00E4': // ä  [LATIN SMALL LETTER A WITH DIAERESIS]
                    case '\u00E5': // å  [LATIN SMALL LETTER A WITH RING ABOVE]
                    case '\u0101': // ā  [LATIN SMALL LETTER A WITH MACRON]
                    case '\u0103': // ă  [LATIN SMALL LETTER A WITH BREVE]
                    case '\u0105': // ą  [LATIN SMALL LETTER A WITH OGONEK]
                    case '\u01CE': // ǎ  [LATIN SMALL LETTER A WITH CARON]
                    case '\u01DF': // ǟ  [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
                    case '\u01E1': // ǡ  [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
                    case '\u01FB': // ǻ  [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
                    case '\u0201': // ȁ  [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
                    case '\u0203': // ȃ  [LATIN SMALL LETTER A WITH INVERTED BREVE]
                    case '\u0227': // ȧ  [LATIN SMALL LETTER A WITH DOT ABOVE]
                    case '\u0250': // ɐ  [LATIN SMALL LETTER TURNED A]
                    case '\u0259': // ə  [LATIN SMALL LETTER SCHWA]
                    case '\u025A': // ɚ  [LATIN SMALL LETTER SCHWA WITH HOOK]
                    case '\u1D8F': // ᶏ  [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
                    case '\u1D95': // ᶕ  [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
                    case '\u1E01': // ạ  [LATIN SMALL LETTER A WITH RING BELOW]
                    case '\u1E9A': // ả  [LATIN SMALL LETTER A WITH RIGHT HALF RING]
                    case '\u1EA1': // ạ  [LATIN SMALL LETTER A WITH DOT BELOW]
                    case '\u1EA3': // ả  [LATIN SMALL LETTER A WITH HOOK ABOVE]
                    case '\u1EA5': // ấ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
                    case '\u1EA7': // ầ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
                    case '\u1EA9': // ẩ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                    case '\u1EAB': // ẫ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
                    case '\u1EAD': // ậ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                    case '\u1EAF': // ắ  [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
                    case '\u1EB1': // ằ  [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
                    case '\u1EB3': // ẳ  [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
                    case '\u1EB5': // ẵ  [LATIN SMALL LETTER A WITH BREVE AND TILDE]
                    case '\u1EB7': // ặ  [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
                    case '\u2090': // ₐ  [LATIN SUBSCRIPT SMALL LETTER A]
                    case '\u2094': // ₔ  [LATIN SUBSCRIPT SMALL LETTER SCHWA]
                    case '\u24D0': // ⓐ  [CIRCLED LATIN SMALL LETTER A]
                    case '\u2C65': // ⱥ  [LATIN SMALL LETTER A WITH STROKE]
                    case '\u2C6F': // Ɐ  [LATIN CAPITAL LETTER TURNED A]
                    case '\uFF41': // a  [FULLWIDTH LATIN SMALL LETTER A]
                        output[outputPos++] = 'a';
                        break;
                    case '\uA732': // Ꜳ  [LATIN CAPITAL LETTER AA]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'A';
                        break;
                    case '\u00C6': // Æ  [LATIN CAPITAL LETTER AE]
                    case '\u01E2': // Ǣ  [LATIN CAPITAL LETTER AE WITH MACRON]
                    case '\u01FC': // Ǽ  [LATIN CAPITAL LETTER AE WITH ACUTE]
                    case '\u1D01': // ᴁ  [LATIN LETTER SMALL CAPITAL AE]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'E';
                        break;
                    case '\uA734': // Ꜵ  [LATIN CAPITAL LETTER AO]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'O';
                        break;
                    case '\uA736': // Ꜷ  [LATIN CAPITAL LETTER AU]
                        output[outputPos++] = 'A';
                        output[outputPos++] = 'U';
                        break;

        // etc. etc. etc.
        // see link above for complete source code
        // 
        // unfortunately, postings are limited, as in
        // "Body is limited to 30000 characters; you entered 136098."

                    [...]

                    case '\u2053': // ⁓  [SWUNG DASH]
                    case '\uFF5E': // ~  [FULLWIDTH TILDE]
                        output[outputPos++] = '~';
                        break;
                    default:
                        output[outputPos++] = c;
                        break;
                }
            }
        }
        return outputPos;
    }
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接