通过区分文化差异的String.IndexOf方法匹配的子字符串长度是多少？

Question

通过区分文化差异的String.IndexOf方法匹配的子字符串长度是多少？

14

我尝试编写一个文化感知的字符串替换方法：

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + oldValue.Length)
        : text;
}

但是，它无法正确处理Unicode组合字符：

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. INCORRECT: do
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. INCORRECT: dóf

为了修复我的代码，我需要知道在第二个例子中，String.IndexOf 只匹配了一个字符 (é)，即使它搜索了两个字符 (e\u0301)。同样的，我需要知道在第三个例子中，String.IndexOf 匹配了两个字符 (e\u0301)，即使它只搜索了一个字符 (é)。

如何确定 String.IndexOf 实际匹配的子字符串长度？ 注意： 对于 text 和 oldValue 执行 Unicode 标准化（正如 James Keesey 建议的那样）可以容纳组合字符，但连字仍然是一个问题：

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. INCORRECT: i
Console.WriteLine(Replace("oef", "œ", "i")); // 6. INCORRECT: ief

- Michael Liu

3个回答

2

以下方法适用于你的例子。它通过比较值来找到需要多少个字符才能与oldValue相等，然后使用该值代替仅使用oldValue.Length。

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    if (index >= 0)
        return text.Substring(0, index) + newValue +
                 text.Substring(index + LengthInString(text, oldValue, index));
    else
        return text;
}
static int LengthInString(string text, string oldValue, int index)
{
    for (int length = 1; length <= text.Length - index; length++)
        if (string.Equals(text.Substring(index, length), oldValue,
                                            StringComparison.CurrentCulture))
            return length;
    throw new Exception("Oops!");
}

- Tim S.

我曾经担心过必须要做这样的事情。在内部，String.IndexOf 必须知道它匹配了多少个字符，但我很惊讶居然没有（明显的？）方法可以获取到这些信息。 - Michael Liu

LengthInString循环应该向下计数。匹配可以比oldValue更长。我们可能应该获得最长的可能匹配。 - usr

for (int length = 1; length < text.Length - index; length++) 应该改为 for (int length = 1; length <= text.Length - index; length++) - 例如，如果 text.Length 是 3，index 是 2，则循环不执行任何操作，但它应该迭代一次。 - Jim W says reinstate Monica

@JimW 谢谢！我没有意识到那个问题。我已经在我的回答中修复了它。Off-by-one 错误一直是我的克星.. - Tim S.

2

我之前说得太早了（并且以前从未见过这种方法），但有一种替代方法。您可以使用StringInfo.ParseCombiningCharacters()方法获取每个实际字符的开头，并使用它来确定要替换的字符串的长度。

在进行索引调用之前，您需要对两个字符串进行规范化。这将确保源和目标字符串具有相同的长度。

请参阅String.Normalize()参考页面，该页面描述了这个确切的问题。

- James Keesey

我不想规范那些不需要替换的字符。 - Michael Liu

@MichaelLiu 为什么不呢？它代表了相同的文本。 - Tim S.

@TimS.：如果用户有意识地选择以特定方式输入原始文本，我宁愿不要干扰这个选择。 - Michael Liu

@JamesKeesey：你能指出文档具体哪里写着“没有其他方法”吗？我看不到。 - Michael Liu

1

事实证明，规范化也不足够。例如，它无法处理连字：OE = Œ。 - Michael Liu

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Ewen · Accepted Answer

你需要直接调用FindNLSString或FindNLSStringEx。 String.IndexOf使用FindNLSStringEx，但你所需的所有信息都可以在FindNLSString中找到。

以下是一个重写Replace方法的示例，可针对你的测试用例运行。请注意，我正在使用当前用户区域设置，如果您想要使用系统区域设置或提供自己的区域设置，请查阅API文档。我还将0作为标志传递，这意味着它将使用区域设置的默认字符串比较选项，同样，文档可以帮助您提供不同的选项。

public const int LOCALE_USER_DEFAULT = 0x0400;

[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);

public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
{
    int foundLength;
    int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
    return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
}