用等价的字符替换特殊字符

Question

用等价的字符替换特殊字符

4

如何将以下特殊字符替换为它们的等效字符？

元音字母：ÁÉÍÓÚáéíóú分别由AEIOUaeiou代替。字母Ñ由N代替。

表达式：

str = regexprep(str,'[^a-zA-Z]','');

将删除所有非字母字符，但如何用类似上面所示的东西替换它们？

谢谢。

- Jorge Zapata

3个回答

5

你可以写一系列的正则表达式，例如：

s = regexprep(s,'(?:À|Á|Â|Ã|Ä|Å)','A')
s = regexprep(s,'(?:Ì|Í|Î|Ï)','I')

对于所有带有重音符号的字符，均适用此规则...（包括大小写字母）

警告：即使是拉丁字母表的小子集，也存在着如此众多的变体

一个更简单的例子：

chars_old = 'ÁÉÍÓÚáéíóú';
chars_new = 'AEIOUaeiou';

str = 'Ámró';
[tf,loc] = ismember(str, chars_old);
str(tf) = chars_new( loc(tf) )

之前的字符串：

>> str
str =
Ámró

之后：

>> str
str =
Amro

- Amro

谢谢@Amro，实际上我只是在处理西班牙语子集，所以特殊字符仅限于上面显示的那些。难道没有更简单的解决方案吗？类似于PHP的str_replace，您可以将等效项作为数组参数传递吗？ - Jorge Zapata

另一种可能性是使用Perl（可用于MATLAB），并使用像Text::Unidecode这样的模块。这是一个非常强大的解决方案，可以执行有趣的操作，例如从Unicode转换为ASCII。它已被移植到其他编程语言中，如Python、Java等（我过去曾使用Python版本）。 - Amro

@JorgeZapata：我添加了一个更简单的示例。chars_old 中的每个字符都被替换为其在 chars_new 中的等价项。您可以以同样的方式将带波浪符号的 N 添加到列表中。 - Amro

3

如果有人仍然需要这个... 我需要，所以我花时间查找了所有最常见的音标：

function [clean_s] = removediacritics(s)
%REMOVEDIACRITICS Removes diacritics from text.
%   This function removes many common diacritics from strings, such as
%     á - the acute accent
%     à - the grave accent
%     â - the circumflex accent
%     ü - the diaeresis, or trema, or umlaut
%     ñ - the tilde
%     ç - the cedilla
%     å - the ring, or bolle
%     ø - the slash, or solidus, or virgule

% uppercase
s = regexprep(s,'(?:Á|À|Â|Ã|Ä|Å)','A');
s = regexprep(s,'(?:Æ)','AE');
s = regexprep(s,'(?:ß)','ss');
s = regexprep(s,'(?:Ç)','C');
s = regexprep(s,'(?:Ð)','D');
s = regexprep(s,'(?:É|È|Ê|Ë)','E');
s = regexprep(s,'(?:Í|Ì|Î|Ï)','I');
s = regexprep(s,'(?:Ñ)','N');
s = regexprep(s,'(?:Ó|Ò|Ô|Ö|Õ|Ø)','O');
s = regexprep(s,'(?:Œ)','OE');
s = regexprep(s,'(?:Ú|Ù|Û|Ü)','U');
s = regexprep(s,'(?:Ý|Ÿ)','Y');

% lowercase
s = regexprep(s,'(?:á|à|â|ä|ã|å)','a');
s = regexprep(s,'(?:æ)','ae');
s = regexprep(s,'(?:ç)','c');
s = regexprep(s,'(?:ð)','d');
s = regexprep(s,'(?:é|è|ê|ë)','e');
s = regexprep(s,'(?:í|ì|î|ï)','i');
s = regexprep(s,'(?:ñ)','n');
s = regexprep(s,'(?:ó|ò|ô|ö|õ|ø)','o');
s = regexprep(s,'(?:œ)','oe');
s = regexprep(s,'(?:ú|ù|ü|û)','u');
s = regexprep(s,'(?:ý|ÿ)','y');

% return cleaned string
clean_s = s;
end

感谢Amro提供简单的解决方案！

- Jim Goodall

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Otto Remse · Accepted Answer

以下代码将标准化所有带变音符号的字符，例如ÅÄÖ。

function inputWash {
    param(
        [string]$inputString
    )
    [string]$formD = $inputString.Normalize(
            [System.text.NormalizationForm]::FormD
    )
    $stringBuilder = new-object System.Text.StringBuilder
    for ($i = 0; $i -lt $formD.Length; $i++){
        $unicodeCategory = [System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($formD[$i])
        $nonSPacingMark = [System.Globalization.UnicodeCategory]::NonSpacingMark
        if($unicodeCategory -ne $nonSPacingMark){
            $stringBuilder.Append($formD[$i]) | out-null
        }
    }
    $string = $stringBuilder.ToString().Normalize([System.text.NormalizationForm]::FormC)
    return $string.toLower()
}
Write-Host inputWash("ÖÄÅÑÜ");

oaanu

如果你不需要这个功能，可以省略 .toLower()。