如何计算字符串中字符或子字符串的出现次数?

1038
我想要统计一个字符串中有多少个斜杠(/)。有几种方法可以做到这一点,但我无法决定哪种方法是最好(或最简单)的。
目前我正在使用类似以下的方法:
string source = "/once/upon/a/time/";
int count = source.Length - source.Replace("/", "").Length;

或者对于长度大于1的字符串:
string haystack = "/once/upon/a/time";
string needle = "/";
int needleCount = ( haystack.Length - haystack.Replace(needle,"").Length ) / needle.Length;

42
我必须说这是一种非常不同的计数方式。我对基准测试结果感到惊讶 :) - naveen
5
这并不是很不同寻常... 这是在SQL中实现此功能的典型方式:LEN(ColumnToCheck) - LEN(REPLACE(ColumnToCheck,"N","")) - Sheridan
7
事实上,你应该用 "/" 符号来进行除法运算。"Length"翻译为长度。 - Gerard
4
我可以问一下,您的要求是在"/////"中出现"//"的次数应该是多少呢?是2还是4? - Les
1
使用正则表达式可能是最好的方法。 - Adam Higgins
你说的“挖掘正则表达式”是什么意思?我猜你指的是Linq,因为它可能更加晦涩,而且可能没有任何减少开销的好处,对吗? - barlop
35个回答

2

一个用于字符串出现次数的通用函数:

public int getNumberOfOccurencies(String inputString, String checkString)
{
    if (checkString.Length > inputString.Length || checkString.Equals("")) { return 0; }
    int lengthDifference = inputString.Length - checkString.Length;
    int occurencies = 0;
    for (int i = 0; i < lengthDifference; i++) {
        if (inputString.Substring(i, checkString.Length).Equals(checkString)) { occurencies++; i += checkString.Length - 1; } }
    return occurencies;
}

2
这会创建大量的临时字符串,使垃圾回收器工作非常艰难。 - EricLaw

2

我想分享我的扩展方法(有关详细信息,请参见评论)。虽然我没有进行正式的基准测试,但我认为它在大多数情况下都应该非常快。

编辑:好吧-所以这个SO问题让我想知道我们当前实现的性能如何与这里提供的一些解决方案相比。我决定做一些基准测试,并发现我们的解决方案在进行大字符串(100 Kb +)、大子字符串(32 Kb +)和许多嵌入式重复(10K +)的积极搜索时,与Richard Watson提供的解决方案的性能非常接近。此时,我们的解决方案速度较慢,约为2倍至4倍。鉴于此事实和我们非常喜欢Richard Watson提出的解决方案,我们已经相应地重构了我们的解决方案。我只是想让任何可能受益于此的人可以使用它。

我们原来的解决方案:

    /// <summary>
    /// Counts the number of occurrences of the specified substring within
    /// the current string.
    /// </summary>
    /// <param name="s">The current string.</param>
    /// <param name="substring">The substring we are searching for.</param>
    /// <param name="aggressiveSearch">Indicates whether or not the algorithm 
    /// should be aggressive in its search behavior (see Remarks). Default 
    /// behavior is non-aggressive.</param>
    /// <remarks>This algorithm has two search modes - aggressive and 
    /// non-aggressive. When in aggressive search mode (aggressiveSearch = 
    /// true), the algorithm will try to match at every possible starting 
    /// character index within the string. When false, all subsequent 
    /// character indexes within a substring match will not be evaluated. 
    /// For example, if the string was 'abbbc' and we were searching for 
    /// the substring 'bb', then aggressive search would find 2 matches 
    /// with starting indexes of 1 and 2. Non aggressive search would find 
    /// just 1 match with starting index at 1. After the match was made, 
    /// the non aggressive search would attempt to make it's next match 
    /// starting at index 3 instead of 2.</remarks>
    /// <returns>The count of occurrences of the substring within the string.</returns>
    public static int CountOccurrences(this string s, string substring, 
        bool aggressiveSearch = false)
    {
        // if s or substring is null or empty, substring cannot be found in s
        if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(substring))
            return 0;

        // if the length of substring is greater than the length of s,
        // substring cannot be found in s
        if (substring.Length > s.Length)
            return 0;

        var sChars = s.ToCharArray();
        var substringChars = substring.ToCharArray();
        var count = 0;
        var sCharsIndex = 0;

        // substring cannot start in s beyond following index
        var lastStartIndex = sChars.Length - substringChars.Length;

        while (sCharsIndex <= lastStartIndex)
        {
            if (sChars[sCharsIndex] == substringChars[0])
            {
                // potential match checking
                var match = true;
                var offset = 1;
                while (offset < substringChars.Length)
                {
                    if (sChars[sCharsIndex + offset] != substringChars[offset])
                    {
                        match = false;
                        break;
                    }
                    offset++;
                }
                if (match)
                {
                    count++;
                    // if aggressive, just advance to next char in s, otherwise, 
                    // skip past the match just found in s
                    sCharsIndex += aggressiveSearch ? 1 : substringChars.Length;
                }
                else
                {
                    // no match found, just move to next char in s
                    sCharsIndex++;
                }
            }
            else
            {
                // no match at current index, move along
                sCharsIndex++;
            }
        }

        return count;
    }

这是我们修改后的解决方案:

    /// <summary>
    /// Counts the number of occurrences of the specified substring within
    /// the current string.
    /// </summary>
    /// <param name="s">The current string.</param>
    /// <param name="substring">The substring we are searching for.</param>
    /// <param name="aggressiveSearch">Indicates whether or not the algorithm 
    /// should be aggressive in its search behavior (see Remarks). Default 
    /// behavior is non-aggressive.</param>
    /// <remarks>This algorithm has two search modes - aggressive and 
    /// non-aggressive. When in aggressive search mode (aggressiveSearch = 
    /// true), the algorithm will try to match at every possible starting 
    /// character index within the string. When false, all subsequent 
    /// character indexes within a substring match will not be evaluated. 
    /// For example, if the string was 'abbbc' and we were searching for 
    /// the substring 'bb', then aggressive search would find 2 matches 
    /// with starting indexes of 1 and 2. Non aggressive search would find 
    /// just 1 match with starting index at 1. After the match was made, 
    /// the non aggressive search would attempt to make it's next match 
    /// starting at index 3 instead of 2.</remarks>
    /// <returns>The count of occurrences of the substring within the string.</returns>
    public static int CountOccurrences(this string s, string substring, 
        bool aggressiveSearch = false)
    {
        // if s or substring is null or empty, substring cannot be found in s
        if (string.IsNullOrEmpty(s) || string.IsNullOrEmpty(substring))
            return 0;

        // if the length of substring is greater than the length of s,
        // substring cannot be found in s
        if (substring.Length > s.Length)
            return 0;

        int count = 0, n = 0;
        while ((n = s.IndexOf(substring, n, StringComparison.InvariantCulture)) != -1)
        {
            if (aggressiveSearch)
                n++;
            else
                n += substring.Length;
            count++;
        }

        return count;
    }

2

我的初步看法给了我这样的东西:

public static int CountOccurrences(string original, string substring)
{
    if (string.IsNullOrEmpty(substring))
        return 0;
    if (substring.Length == 1)
        return CountOccurrences(original, substring[0]);
    if (string.IsNullOrEmpty(original) ||
        substring.Length > original.Length)
        return 0;
    int substringCount = 0;
    for (int charIndex = 0; charIndex < original.Length; charIndex++)
    {
        for (int subCharIndex = 0, secondaryCharIndex = charIndex; subCharIndex < substring.Length && secondaryCharIndex < original.Length; subCharIndex++, secondaryCharIndex++)
        {
            if (substring[subCharIndex] != original[secondaryCharIndex])
                goto continueOuter;
        }
        if (charIndex + substring.Length > original.Length)
            break;
        charIndex += substring.Length - 1;
        substringCount++;
    continueOuter:
        ;
    }
    return substringCount;
}

public static int CountOccurrences(string original, char @char)
{
    if (string.IsNullOrEmpty(original))
        return 0;
    int substringCount = 0;
    for (int charIndex = 0; charIndex < original.Length; charIndex++)
        if (@char == original[charIndex])
            substringCount++;
    return substringCount;
}

使用替换和分割的“大海捞针”方法需要21秒以上,而这种方法只需要约15.2秒。在添加一些内容后进行编辑,将substring.Length - 1添加到charIndex中(像应该的那样),时间为11.6秒。编辑2:我使用了一个包含26个双字符字符串的字符串,以下是更新为相同示例文本的时间:大海捞针(OP版本):7.8秒。建议的机制:4.6秒。编辑3:添加单个字符的角落情况,时间为1.2秒。编辑4:为了上下文,使用了5000万次迭代。

2

字符串中的字符串:

在 " .. JD JD JD JD etc. and etc. JDJDJDJDJDJDJDJD and etc." 中查找 "etc"。

var strOrigin = " .. JD JD JD JD etc. and etc. JDJDJDJDJDJDJDJD and etc.";
var searchStr = "etc";
int count = (strOrigin.Length - strOrigin.Replace(searchStr, "").Length)/searchStr.Length.

在将其视为不可靠/笨拙之前,请检查性能...

2

Split(可能)比IndexOf(对于字符串)更胜一筹。

上面的基准测试似乎表明Richard Watson是最快的字符串,但这是错误的(也许差异来自我们的测试数据,但由于以下原因似乎很奇怪)。

如果我们在.NET中深入研究这些方法的实现(对于Luke H、Richard Watson方法),

  • IndexOf依赖于文化,它将尝试检索/创建ReadOnlySpan,检查是否必须忽略大小写等等,然后最终进行不安全/本地调用。
  • Split能够处理多个分隔符,并具有一些StringSplitOptions,必须创建string[]数组并用拆分结果填充它(因此进行一些子字符串操作)。根据字符串出现的次数,Split可能比IndexOf更快。

顺便说一下,我制作了一个简化版的IndexOf(如果我使用指针和不安全代码,它可能会更快,但对于大多数人来说,未经检查应该没问题),它至少快了4个数量级

基准测试(源代码在GitHub上)

通过在莎士比亚·理查三世中搜索常用单词(the)或小句子来完成。

方法 平均值 误差 标准差 比率
Richard_LongInLong 67.721微秒 1.0278微秒 0.9614微秒 1.00
Luke_LongInLong 1.960微秒 0.0381微秒 0.0637微秒 0.03
Fab_LongInLong 1.198微秒 0.0160微秒 0.0142微秒 0.02
-------------------- -----------: ----------: ----------: ------:
Richard_ShortInLong 104.771微秒 2.8117微秒 7.9304微秒 1.00
Luke_ShortInLong 2.971微秒 0.0594微秒 0.0813微秒 0.03
Fab_ShortInLong 2.206微秒 0.0419微秒 0.0411微秒 0.02
--------------------- ----------: ---------: ---------: ------:
Richard_ShortInShort 115.53纳秒 1.359纳秒 1.135纳秒 1.00
Luke_ShortInShort 52.46纳秒 0.970纳秒 0.908纳秒 0.45
Fab_ShortInShort 28.47纳秒 0.552纳秒 0.542纳秒 0.25
public int GetOccurrences(string input, string needle)
{
    int count = 0;
    unchecked
    {
        if (string.IsNullOrEmpty(input) || string.IsNullOrEmpty(needle))
        {
            return 0;
        }

        for (var i = 0; i < input.Length - needle.Length + 1; i++)
        {
            var c = input[i];
            if (c == needle[0])
            {
                for (var index = 0; index < needle.Length; index++)
                {
                    c = input[i + index];
                    var n = needle[index];

                    if (c != n)
                    {
                        break;
                    }
                    else if (index == needle.Length - 1)
                    {
                        count++;
                    }
                }
            }
        }
    }

    return count;
}

2
string source = "/once/upon/a/time/";
int count = 0, n = 0;
while ((n = source.IndexOf('/', n) + 1) != 0) count++;

以下是对Richard Watson答案的一种变体,随着字符在字符串中出现的次数增多,效率逐渐提高,代码量也更少!

不过我必须说,在没有对每种情况进行广泛测试的情况下,我确实看到了非常显著的速度提升,方法是使用:

int count = 0;
for (int n = 0; n < source.Length; n++) if (source[n] == '/') count++;

2
            var conditionalStatement = conditionSetting.Value;

            //order of replace matters, remove == before =, incase of ===
            conditionalStatement = conditionalStatement.Replace("==", "~").Replace("!=", "~").Replace('=', '~').Replace('!', '~').Replace('>', '~').Replace('<', '~').Replace(">=", "~").Replace("<=", "~");

            var listOfValidConditions = new List<string>() { "!=", "==", ">", "<", ">=", "<=" };

            if (conditionalStatement.Count(x => x == '~') != 1)
            {
                result.InvalidFieldList.Add(new KeyFieldData(batch.DECurrentField, "The IsDoubleKeyCondition does not contain a supported conditional statement. Contact System Administrator."));
                result.Status = ValidatorStatus.Fail;
                return result;
            }

需要做类似的事情来测试字符串中的条件语句。

用单个字符替换我要查找的内容,并计算单个字符的出现次数。

显然,在进行此操作之前,需要检查使用的单个字符是否存在于字符串中,以避免错误计数。


1
string Name = "Very good nice one is very good but is very good nice one this is called the term";
bool valid=true;
int count = 0;
int k=0;
int m = 0;
while (valid)
{
    k = Name.Substring(m,Name.Length-m).IndexOf("good");
    if (k != -1)
    {
        count++;
        m = m + k + 4;
    }
    else
        valid = false;
}
Console.WriteLine(count + " Times accures");

1
str="aaabbbbjjja";
int count = 0;
int size = str.Length;

string[] strarray = new string[size];
for (int i = 0; i < str.Length; i++)
{
    strarray[i] = str.Substring(i, 1);
}
Array.Sort(strarray);
str = "";
for (int i = 0; i < strarray.Length - 1; i++)
{

    if (strarray[i] == strarray[i + 1])
    {

        count++;
    }
    else
    {
        count++;
        str = str + strarray[i] + count;
        count = 0;
    }

}
count++;
str = str + strarray[strarray.Length - 1] + count;

这是用于计算字符出现次数的程序。对于此示例,输出将为“a4b4j3”。


2
不完全是“计算字符串出现次数”,更多的是计算字符 - 有没有一种指定要匹配的字符串的方法,Narenda? - Paul Sullivan
1
int count = 0; string str = "我们有foo和foo,请在其中计算foo的数量"; string stroccurance = "foo"; string[] strarray = str.Split(' '); Array.Sort(strarray); str = ""; for (int i = 0; i < strarray.Length - 1; i++) { if (strarray[i] == stroccurance) { count++; } } str = "字符串" + stroccurance + "出现的次数为" + count + "次。通过这个例子,您可以计算任何字符串的出现次数。"; - Narendra Kumar

1
如果你查看这个网页,其中15种不同的方法进行了基准测试,包括使用并行循环。
最快的方法似乎是使用单线程for循环(如果你有.Net版本<4.0)或并行.for循环(如果使用的是.Net>4.0且具有数千个检查)。
假设"ss"是你的搜索字符串,"ch"是你的字符数组(如果你正在寻找多个字符),那么下面是具有最快运行时间单线程代码的基本要点:
for (int x = 0; x < ss.Length; x++)
{
    for (int y = 0; y < ch.Length; y++)
    {
        for (int a = 0; a < ss[x].Length; a++ )
        {
        if (ss[x][a] == ch[y])
            //it's found. DO what you need to here.
        }
    }
}

提供基准源代码,以便您可以运行自己的测试。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接