字符串比较性能（与修剪相结合）

Question

字符串比较性能（与修剪相结合）

10

我需要做很多高性能的不区分大小写的字符串比较，并意识到我的方法.ToLower().Trim()由于所有新字符串都被分配了，所以非常愚蠢。

因此，我进行了一些探索，发现以下方法更可取：

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

唯一的问题在于我想忽略字符串开头或结尾的空格，也就是说需要使用 Trim() 函数，但如果使用 Trim() 函数，又会面临字符串分配的问题。我猜我可以检查每个字符串，看它是否以空格开头或结尾，只有那些字符串才进行 Trim 操作。要么就找出每个字符串的索引和长度，然后传递给 string.Compare 函数的一个重载版本。

public static int Compare
(
    string strA,
    int indexA,
    string strB,
    int indexB,
    int length,
    StringComparison comparisonType
)

但这似乎相当混乱，我可能必须使用一些整数，如果不为每个字符串的尾随和前导空格组合制作一个非常大的if-else语句...那么有没有什么优雅的解决方案？

这是我的当前提议：

public bool IsEqual(string a, string b)
    {
        return (string.Compare(a, b, StringComparison.OrdinalIgnoreCase) == 0);
    }

    public bool IsTrimEqual(string a, string b)
    {
        if (Math.Abs(a.Length- b.Length) > 2 ) // if length differs by more than 2, cant be equal
        {
            return  false;
        }
        else if (IsEqual(a,b))
        {
            return true;
        }
        else 
        {
            return (string.Compare(a.Trim(), b.Trim(), StringComparison.OrdinalIgnoreCase) == 0);
        }
    }

- Homde

5

你是怎么认为有问题的？过早地优化是一个不好的想法 —— 在你的应用程序变得“太慢”之前，没有必要进行优化。与其追求快速代码，不如先专注于清晰易懂的代码。 - Anon.

我也会质疑这是否真正需要微观优化？您在这个领域真的有性能问题吗？我想，在其他领域，您可以获得更大的性能改进。 - AdaTheDev

这是针对一个非常大的字符串集合的搜索引擎，因此我认为在这种情况下进行优化是相关的。此外，在工具箱中拥有一个良好的字符串比较方法也不是什么坏事。 - Homde

@Anon：我认为这不是过早优化。如果有大量的字符串，如果您为每个比较创建新的字符串实例，那么时间会显著延长。只需运行一些测试并自行查看即可... - Thomas Levesque

当然，有哈希表，但我需要它来完成其他任务，无论如何，我只是好奇是否有更优雅的方法来完成这个任务。嗯，如果输入字符串中包含大量空格，那么好吧，这只是一个夜晚的想法。 - Homde

显示剩余3条评论

9个回答

3

我会使用你提供的代码。

String.Compare(txt1,txt2, StringComparison.OrdinalIgnoreCase)

并根据需要添加任何 .Trim() 调用。这将在大多数情况下节省您最初的 4 个字符串（.ToLower().Trim()），并在所有情况下节省两个字符串（.ToLower()）。

如果在此之后遇到性能问题，则“混乱”的选项可能是最佳选择。

- Nate

这很有趣。Mattias：如果大部分字符串不需要trim()调用，那么你通常可以这样做，如果字符串不匹配，就回退并尝试使用trim()调用，然后“真正”返回它们不匹配。 - Mike Atlas

那么，我想我应该运行一些测试来查看所需的 IsPrefix() / IsSuffix() （四个）是否比仅执行 Trim 操作需要更多或更少的性能。 - Homde

啊，当然！先进行比较，如果不为0，则进行修剪比较（或混乱的方法），非常好。 - Homde

2

首先，请确保您确实需要优化此代码。也许创建字符串的副本不会对您的程序产生明显影响。

如果您确实需要进行优化，可以尝试在存储字符串时而不是比较字符串时处理它们（假设这些操作发生在程序的不同阶段）。例如，存储修剪和小写版本的字符串，以便在比较它们时可以简单地检查它们是否相等。

- Amnon

在这种情况下使用更有效的方法没有任何问题。使用String.Compare不是什么“聪明”的技巧，它是一种内置的比较字符串的方式，也比调用ToUpper().ToLower()更有效。它的意图也更清晰，因此我认为在这种情况下你不能提出一个有效的“过早优化”的案例。 - Ed S.

我理解你的意思是 Trim().ToLower()。 - Vinko Vrsalovic

2

你是否可以只在获取字符串时修剪（并可能将其小写化）每个字符串一次？做更多的事情听起来像是过早的优化...

- Fried Hoeben

当然，在某些情况下我可以这样做，只是想看看是否有一种优化的通用方法来做到这一点。 - Homde

1

我注意到你的第一个建议只是比较相等而不是排序，这可以带来进一步的效率节省。

public static bool TrimmedOrdinalIgnoreCaseEquals(string x, string y)
{
    //Always check for identity (same reference) first for
    //any comparison (equality or otherwise) that could take some time.
    //Identity always entails equality, and equality always entails
    //equivalence.
    if(ReferenceEquals(x, y))
        return true;
    //We already know they aren't both null as ReferenceEquals(null, null)
    //returns true.
    if(x == null || y == null)
        return false;
    int startX = 0;
    //note we keep this one further than the last char we care about.
    int endX = x.Length;
    int startY = 0;
    //likewise, one further than we care about.
    int endY = y.Length;
    while(startX != endX && char.IsWhiteSpace(x[startX]))
        ++startX;
    while(startY != endY && char.IsWhiteSpace(y[startY]))
        ++startY;
    if(startX == endX)      //Empty when trimmed.
        return startY == endY;
    if(startY == endY)
        return false;
    //lack of bounds checking is safe as we would have returned
    //already in cases where endX and endY can fall below zero.
    while(char.IsWhiteSpace(x[endX - 1]))
        --endX;
    while(char.IsWhiteSpace(y[endY - 1]))
        --endY;
    //From this point on I am assuming you do not care about
    //the complications of case-folding, based on your example
    //referencing the ordinal version of string comparison
    if(endX - startX != endY - startY)
        return false;
    while(startX != endX)
    {
        //trade-off: with some data a case-sensitive
        //comparison first
        //could be more efficient.
        if(
            char.ToLowerInvariant(x[startX++])
            != char.ToLowerInvariant(y[startY++])
        )
            return false;
    }
    return true;
}

当然，没有匹配的哈希码生成器，等式检查器又有何用：

public static int TrimmedOrdinalIgnoreCaseHashCode(string str)
{
    //Higher CMP_NUM (or get rid of it altogether) gives
    //better hash, at cost of taking longer to compute.
    const int CMP_NUM = 12;
    if(str == null)
        return 0;
    int start = 0;
    int end = str.Length;
    while(start != end && char.IsWhiteSpace(str[start]))
        ++start;
    if(start != end)
        while(char.IsWhiteSpace(str[end - 1]))
            --end;

    int skipOn = (end - start) / CMP_NUM + 1;
    int ret = 757602046; // no harm matching native .NET with empty string.
    while(start < end)
    {
            //prime numbers are our friends.
        ret = unchecked(ret * 251 + (int)(char.ToLowerInvariant(str[start])));
        start += skipOn;
    }
    return ret;
}

- Jon Hanna

0

警告关于过早优化的问题是正确的，但我假设你已经测试过了，并发现大量时间被浪费在复制字符串上。在这种情况下，我建议尝试以下方法：

int startIndex1, length1, startIndex2, length2;
FindStartAndLength(txt1, out startIndex1, out length1);
FindStartAndLength(txt2, out startIndex2, out length2);

int compareLength = Math.Max(length1, length2);
int result = string.Compare(txt1, startIndex1, txt2, startIndex2, compareLength);

FindStartAndLength是一个函数，用于查找“修剪”字符串的起始索引和长度（这尚未经过测试，但应该能够给出一般想法）：

static void FindStartAndLength(string text, out int startIndex, out int length)
{
    startIndex = 0;
    while(char.IsWhiteSpace(text[startIndex]) && startIndex < text.Length)
        startIndex++;

    length = text.Length - startIndex;
    while(char.IsWhiteSpace(text[startIndex + length - 1]) && length > 0)
        length--;
}

- Aaron

0

你可以实现自己的 StringComparer。以下是一个基本实现示例：

public class TrimmingStringComparer : StringComparer
{
    private StringComparison _comparisonType;

    public TrimmingStringComparer()
        : this(StringComparison.CurrentCulture)
    {
    }

    public TrimmingStringComparer(StringComparison comparisonType)
    {
        _comparisonType = comparisonType;
    }

    public override int Compare(string x, string y)
    {
        int indexX;
        int indexY;
        int lengthX = TrimString(x, out indexX);
        int lengthY = TrimString(y, out indexY);

        if (lengthX <= 0 && lengthY <= 0)
            return 0; // both strings contain only white space

        if (lengthX <= 0)
            return -1; // x contains only white space, y doesn't

        if (lengthY <= 0)
            return 1; // y contains only white space, x doesn't

        if (lengthX < lengthY)
            return -1; // x is shorter than y

        if (lengthY < lengthX)
            return 1; // y is shorter than x

        return String.Compare(x, indexX, y, indexY, lengthX, _comparisonType);
    }

    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    public override int GetHashCode(string obj)
    {
        throw new NotImplementedException();
    }

    private int TrimString(string s, out int index)
    {
        index = 0;
        while (index < s.Length && Char.IsWhiteSpace(s, index)) index++;
        int last = s.Length - 1;
        while (last >= 0 && Char.IsWhiteSpace(s, last)) last--;
        return last - index + 1;
    }
}

备注：

它没有经过广泛测试，可能包含错误
性能尚未评估（但它可能比调用Trim和ToLower更好）
GetHashCode方法未实现，因此不要将其用作字典中的键

- Thomas Levesque

0

事实上，如果需要完成，就需要完成。我不认为任何您提出的不同解决方案会有所不同。在每种情况下，都需要进行一些比较才能找到空格或将其删除。

显然，删除空格是问题的一部分，所以您不必担心。

在比较之前将字符串转换为小写字母，如果您使用 Unicode 字符并且可能比复制一个字符串更慢，则会出现错误。

- Peter Stuifzand

0

使用现代版本的.NET和Span<char>，现在可以很容易地实现这一点，而不会牺牲性能。

public static bool EqualsIgnoreLeadingTrailingWhitespaces(
        string a, 
        string b,
        StringComparison comparison = StringComparison.OrdinalIgnoreCase)
{
    if (ReferenceEquals(a, b))
        return true;
    if (a is null || b is null)
        return false;

    // Memory allocation free trimming
    ReadOnlySpan<char> s1 = a.AsSpan().Trim();
    ReadOnlySpan<char> s2 = b.AsSpan().Trim();

    return s1.Equals(s2, comparison);
}

上述代码比较了两个字符串是否相等，但是可以很容易地通过使用CompareTo()而不是Equals()来对字符串进行排序。

- Isak Savo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Guffa · Accepted Answer

这样的内容应该可以做到：

public static int TrimCompareIgnoreCase(string a, string b) {
   int indexA = 0;
   int indexB = 0;
   while (indexA < a.Length && Char.IsWhiteSpace(a[indexA])) indexA++;
   while (indexB < b.Length && Char.IsWhiteSpace(b[indexB])) indexB++;
   int lenA = a.Length - indexA;
   int lenB = b.Length - indexB;
   while (lenA > 0 && Char.IsWhiteSpace(a[indexA + lenA - 1])) lenA--;
   while (lenB > 0 && Char.IsWhiteSpace(b[indexB + lenB - 1])) lenB--;
   if (lenA == 0 && lenB == 0) return 0;
   if (lenA == 0) return 1;
   if (lenB == 0) return -1;
   int result = String.Compare(a, indexA, b, indexB, Math.Min(lenA, lenB), true);
   if (result == 0) {
      if (lenA < lenB) result--;
      if (lenA > lenB) result++;
   }
   return result;
}

例子：

string a = "  asdf ";
string b = " ASDF \t   ";

Console.WriteLine(TrimCompareIgnoreCase(a, b));

输出：

你应该对它进行简单的修剪和比较，并使用一些真实数据来进行分析，以查看在你要使用它的情况下是否真的有任何区别。