如何计算字符串中字符或子字符串的出现次数？

Question

如何计算字符串中字符或子字符串的出现次数？

1038

我想要统计一个字符串中有多少个斜杠（/）。有几种方法可以做到这一点，但我无法决定哪种方法是最好（或最简单）的。

目前我正在使用类似以下的方法：

string source = "/once/upon/a/time/";
int count = source.Length - source.Replace("/", "").Length;

或者对于长度大于1的字符串：

string haystack = "/once/upon/a/time";
string needle = "/";
int needleCount = ( haystack.Length - haystack.Replace(needle,"").Length ) / needle.Length;

- in.spite

42

我必须说这是一种非常不同的计数方式。我对基准测试结果感到惊讶 :) - naveen

5

这并不是很不同寻常... 这是在SQL中实现此功能的典型方式：LEN(ColumnToCheck) - LEN(REPLACE(ColumnToCheck,"N",""))。 - Sheridan

7

事实上，你应该用 "/" 符号来进行除法运算。"Length"翻译为长度。 - Gerard

4

我可以问一下，您的要求是在"/////"中出现"//"的次数应该是多少呢？是2还是4？ - Les

1

使用正则表达式可能是最好的方法。 - Adam Higgins

你说的“挖掘正则表达式”是什么意思？我猜你指的是Linq，因为它可能更加晦涩，而且可能没有任何减少开销的好处，对吗？ - barlop

35个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dave · Answer 1

16

在C＃中，一个不错的字符串子串计数器是这个出人意料的棘手家伙：

public static int CCount(String haystack, String needle)
{
    return haystack.Split(new[] { needle }, StringSplitOptions.None).Length - 1;
}

- Dave

1

不错的解决方案 - 并且适用于字符串（不仅仅是字符）！ - ChriPf

谢谢，当我们转换语言时，很容易忘记字符串处理的一些细节 - 就像我们大多数人现在必须做的那样！ - Dave

2

-1的原因是：你知道Count()和Count或Length之间的区别吗？如果有人使用Count()而不是Count或Length，我会感到不安。Count()创建IEnumerator，然后遍历IEnumerable的所有出现，而Count或Length是对象的已设置属性，它们已经保存了您想要的计数，无需迭代所有元素。 - aeroson

不错的发现，奇怪的是，在我的库中，我使用了"Length"函数。已编辑！ - Dave

3是我想要的答案——像单词计数一样——但你当然也是对的。两个答案都不是“错误”的，但开发人员需要知道哪个对他们的情况正确。 - Dave

显示剩余2条评论

- Ben · Answer 2

private int CountWords(string text, string word) {
    int count = (text.Length - text.Replace(word, "").Length) / word.Length;
    return count;
}

因为原始解决方案对于字符是最快的，我认为对于字符串也是一样。因此，这是我的贡献。

为了背景：我在日志文件中寻找像“失败”和“成功”这样的单词。

问候， Ben

- preetham · Answer 3

string s = "65 fght 6565 4665 hjk";
int count = 0;
foreach (Match m in Regex.Matches(s, "65"))
  count++;

- bmiller · Answer 4

自从.NET 5（Net core 2.1+和NetStandard 2.1）推出后，我们有了一个新的迭代速度之王。

"Span"是一个新类型 https://learn.microsoft.com/en-us/dotnet/api/system.span-1?view=net-5.0，它可以帮助我们更快地处理数据。

同时，String也有一个内置成员，可以返回Span类型的值。

int count = 0;
foreach( var c in source.AsSpan())
{
    if (c == '/')
        count++;
}

我的测试结果显示使用foreach要比普通的for循环快62%，我还将其与在Span<T>[i]上使用for()循环进行了比较，并且与其他一些在这里发布的方法进行了比较。请注意，对一个String进行反向for()迭代现在似乎比直接使用foreach运行得更慢。

Starting test, 10000000 iterations
(base) foreach =   673 ms

fastest to slowest
foreach Span =   252 ms   62.6%
  Span [i--] =   282 ms   58.1%
  Span [i++] =   402 ms   40.3%
   for [i++] =   454 ms   32.5%
   for [i--] =   867 ms  -28.8%
     Replace =  1905 ms -183.1%
       Split =  2109 ms -213.4%
  Linq.Count =  3797 ms -464.2%

更新：2021年12月，Visual Studio 2022，.NET 5和6

.NET 5
Starting test, 100000000 iterations set
(base) foreach =  7658 ms
fastest to slowest
  foreach Span =   3710 ms     51.6%
    Span [i--] =   3745 ms     51.1%
    Span [i++] =   3932 ms     48.7%
     for [i++] =   4593 ms     40.0%
     for [i--] =   7042 ms      8.0%
(base) foreach =   7658 ms      0.0%
       Replace =  18641 ms   -143.4%
         Split =  21469 ms   -180.3%
          Linq =  39726 ms   -418.8%
Regex Compiled = 128422 ms -1,577.0%
         Regex = 179603 ms -2,245.3%
         
         
.NET 6
Starting test, 100000000 iterations set
(base) foreach =  7343 ms
fastest to slowest
  foreach Span =   2918 ms     60.3%
     for [i++] =   2945 ms     59.9%
    Span [i++] =   3105 ms     57.7%
    Span [i--] =   5076 ms     30.9%
(base) foreach =   7343 ms      0.0%
     for [i--] =   8645 ms    -17.7%
       Replace =  18307 ms   -149.3%
         Split =  21440 ms   -192.0%
          Linq =  39354 ms   -435.9%
Regex Compiled = 114178 ms -1,454.9%
         Regex = 186493 ms -2,439.7%

我添加了更多的循环并使用了RegEx，以便我们可以看到在许多迭代中使用它是多么灾难性的。

我认为for(++) 循环比较可能在 .NET 6 中进行了优化，内部使用Span - 因为它的速度几乎与 foreach span 相同。

代码链接

- WhoIsRich · Answer 5

如果有人需要一个可以直接使用的字符串扩展方法，以下是我使用的基于最佳答案的方法:

public static class StringExtension
{    
    /// <summary> Returns the number of occurences of a string within a string, optional comparison allows case and culture control. </summary>
    public static int Occurrences(this System.String input, string value, StringComparison stringComparisonType = StringComparison.Ordinal)
    {
        if (String.IsNullOrEmpty(value)) return 0;

        int count    = 0;
        int position = 0;

        while ((position = input.IndexOf(value, position, stringComparisonType)) != -1)
        {
            position += value.Length;
            count    += 1;
        }

        return count;
    }

    /// <summary> Returns the number of occurences of a single character within a string. </summary>
    public static int Occurrences(this System.String input, char value)
    {
        int count = 0;
        foreach (char c in input) if (c == value) count += 1;
        return count;
    }
}

- user460847 · Answer 6

public static int GetNumSubstringOccurrences(string text, string search)
{
    int num = 0;
    int pos = 0;

    if (!string.IsNullOrEmpty(text) && !string.IsNullOrEmpty(search))
    {
        while ((pos = text.IndexOf(search, pos)) > -1)
        {
            num ++;
            pos += search.Length;
        }
    }
    return num;
}

- Beroc · Answer 7

我认为最简单的方法是使用正则表达式。这样，您可以像使用myVar.Split('x')一样获得相同的拆分计数，但在多个字符设置中。

string myVar = "do this to count the number of words in my wording so that I can word it up!";
int count = Regex.Split(myVar, "word").Length;

- Timo · Answer 8

自 .NET 7 起，我们拥有无需分配内存（并且高度优化）的正则表达式 API。计数尤为简单和高效。

    var input = "abcd abcabc ababc";
    var result = Regex.Count(input: input, pattern: "abc"); // 4

在匹配动态模式时，记得对它们进行转义：

public static int CountOccurences(string input, string pattern)
{
    pattern = Regex.Escape(pattern); // Aww, no way to avoid heap allocations here

    var result = Regex.Count(input: input, pattern: pattern);
    return result;
}

而且，作为固定模式的额外奖励，.NET 7引入了分析器，帮助将正则表达式字符串转换为源代码生成的代码。这不仅避免了正则表达式的运行时编译开销，还提供非常易读的代码来展示它是如何实现的。事实上，该代码通常至少与任何您手动编写的替代方案一样高效。

如果您的正则表达式调用符合条件，分析器会给出提示。只需选择"转换为'GeneratedRegexAttribute'"并享受结果：

[GeneratedRegex("abc")]
private static partial Regex MyRegex(); // Go To Definition to see the generated code

- Nicholas R. Grant · Answer 9

我感觉我们缺少某些子字符串计数方式，比如不安全的逐字节比较。我结合了原始海报的方法和我能想到的任何方法。

以下是我制作的字符串扩展。

namespace Example
{
    using System;
    using System.Text;

    public static class StringExtensions
    {
        public static int CountSubstr(this string str, string substr)
        {
            return (str.Length - str.Replace(substr, "").Length) / substr.Length;
        }

        public static int CountSubstr(this string str, char substr)
        {
            return (str.Length - str.Replace(substr.ToString(), "").Length);
        }

        public static int CountSubstr2(this string str, string substr)
        {
            int substrlen = substr.Length;
            int lastIndex = str.IndexOf(substr, 0, StringComparison.Ordinal);
            int count = 0;
            while (lastIndex != -1)
            {
                ++count;
                lastIndex = str.IndexOf(substr, lastIndex + substrlen, StringComparison.Ordinal);
            }

            return count;
        }

        public static int CountSubstr2(this string str, char substr)
        {
            int lastIndex = str.IndexOf(substr, 0);
            int count = 0;
            while (lastIndex != -1)
            {
                ++count;
                lastIndex = str.IndexOf(substr, lastIndex + 1);
            }

            return count;
        }

        public static int CountChar(this string str, char substr)
        {
            int length = str.Length;
            int count = 0;
            for (int i = 0; i < length; ++i)
                if (str[i] == substr)
                    ++count;

            return count;
        }

        public static int CountChar2(this string str, char substr)
        {
            int count = 0;
            foreach (var c in str)
                if (c == substr)
                    ++count;

            return count;
        }

        public static unsafe int CountChar3(this string str, char substr)
        {
            int length = str.Length;
            int count = 0;
            fixed (char* chars = str)
            {
                for (int i = 0; i < length; ++i)
                    if (*(chars + i) == substr)
                        ++count;
            }

            return count;
        }

        public static unsafe int CountChar4(this string str, char substr)
        {
            int length = str.Length;
            int count = 0;
            fixed (char* chars = str)
            {
                for (int i = length - 1; i >= 0; --i)
                    if (*(chars + i) == substr)
                        ++count;
            }

            return count;
        }

        public static unsafe int CountSubstr3(this string str, string substr)
        {
            int length = str.Length;
            int substrlen = substr.Length;
            int count = 0;
            fixed (char* strc = str)
            {
                fixed (char* substrc = substr)
                {
                    int n = 0;

                    for (int i = 0; i < length; ++i)
                    {
                        if (*(strc + i) == *(substrc + n))
                        {
                            ++n;
                            if (n == substrlen)
                            {
                                ++count;
                                n = 0;
                            }
                        }
                        else
                            n = 0;
                    }
                }
            }

            return count;
        }

        public static int CountSubstr3(this string str, char substr)
        {
            return CountSubstr3(str, substr.ToString());
        }

        public static unsafe int CountSubstr4(this string str, string substr)
        {
            int length = str.Length;
            int substrLastIndex = substr.Length - 1;
            int count = 0;
            fixed (char* strc = str)
            {
                fixed (char* substrc = substr)
                {
                    int n = substrLastIndex;

                    for (int i = length - 1; i >= 0; --i)
                    {
                        if (*(strc + i) == *(substrc + n))
                        {
                            if (--n == -1)
                            {
                                ++count;
                                n = substrLastIndex;
                            }
                        }
                        else
                            n = substrLastIndex;
                    }
                }
            }

            return count;
        }

        public static int CountSubstr4(this string str, char substr)
        {
            return CountSubstr4(str, substr.ToString());
        }
    }
}

跟随测试代码...

static void Main()
{
    const char matchA = '_';
    const string matchB = "and";
    const string matchC = "muchlongerword";
    const string testStrA = "_and_d_e_banna_i_o___pfasd__and_d_e_banna_i_o___pfasd_";
    const string testStrB = "and sdf and ans andeians andano ip and and sdf and ans andeians andano ip and";
    const string testStrC =
        "muchlongerword amuchlongerworsdfmuchlongerwordsdf jmuchlongerworijv muchlongerword sdmuchlongerword dsmuchlongerword";
    const int testSize = 1000000;
    Console.WriteLine(testStrA.CountSubstr('_'));
    Console.WriteLine(testStrA.CountSubstr2('_'));
    Console.WriteLine(testStrA.CountSubstr3('_'));
    Console.WriteLine(testStrA.CountSubstr4('_'));
    Console.WriteLine(testStrA.CountChar('_'));
    Console.WriteLine(testStrA.CountChar2('_'));
    Console.WriteLine(testStrA.CountChar3('_'));
    Console.WriteLine(testStrA.CountChar4('_'));
    Console.WriteLine(testStrB.CountSubstr("and"));
    Console.WriteLine(testStrB.CountSubstr2("and"));
    Console.WriteLine(testStrB.CountSubstr3("and"));
    Console.WriteLine(testStrB.CountSubstr4("and"));
    Console.WriteLine(testStrC.CountSubstr("muchlongerword"));
    Console.WriteLine(testStrC.CountSubstr2("muchlongerword"));
    Console.WriteLine(testStrC.CountSubstr3("muchlongerword"));
    Console.WriteLine(testStrC.CountSubstr4("muchlongerword"));
    var timer = new Stopwatch();
    timer.Start();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr(matchA);
    timer.Stop();
    Console.WriteLine("CS1 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr(matchB);
    timer.Stop();
    Console.WriteLine("CS1 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr(matchC);
    timer.Stop();
    Console.WriteLine("CS1 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr2(matchA);
    timer.Stop();
    Console.WriteLine("CS2 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr2(matchB);
    timer.Stop();
    Console.WriteLine("CS2 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr2(matchC);
    timer.Stop();
    Console.WriteLine("CS2 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr3(matchA);
    timer.Stop();
    Console.WriteLine("CS3 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr3(matchB);
    timer.Stop();
    Console.WriteLine("CS3 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr3(matchC);
    timer.Stop();
    Console.WriteLine("CS3 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountSubstr4(matchA);
    timer.Stop();
    Console.WriteLine("CS4 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrB.CountSubstr4(matchB);
    timer.Stop();
    Console.WriteLine("CS4 and: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrC.CountSubstr4(matchC);
    timer.Stop();
    Console.WriteLine("CS4 mlw: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar(matchA);
    timer.Stop();
    Console.WriteLine("CC1 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar2(matchA);
    timer.Stop();
    Console.WriteLine("CC2 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar3(matchA);
    timer.Stop();
    Console.WriteLine("CC3 chr: " + timer.Elapsed.TotalMilliseconds + "ms");

    timer.Restart();
    for (int i = 0; i < testSize; ++i)
        testStrA.CountChar4(matchA);
    timer.Stop();
    Console.WriteLine("CC4 chr: " + timer.Elapsed.TotalMilliseconds + "ms");
}

结果：CSX对应于CountSubstrX，CCX对应于CountCharX。"chr"搜索字符串中的"_"，"and"搜索字符串中的"and"，"mlw"搜索字符串中的"muchlongerword"。

CS1 chr: 824.123ms
CS1 and: 586.1893ms
CS1 mlw: 486.5414ms
CS2 chr: 127.8941ms
CS2 and: 806.3918ms
CS2 mlw: 497.318ms
CS3 chr: 201.8896ms
CS3 and: 124.0675ms
CS3 mlw: 212.8341ms
CS4 chr: 81.5183ms
CS4 and: 92.0615ms
CS4 mlw: 116.2197ms
CC1 chr: 66.4078ms
CC2 chr: 64.0161ms
CC3 chr: 65.9013ms
CC4 chr: 65.8206ms

最后，我有一个包含360万个字符的文件。里面是“derp adfderdserp dfaerpderp deasderp”的重复出现100,000次。我使用以上方法在文件中搜索了100遍“derp”，以下是搜索结果。

CS1Derp: 1501.3444ms
CS2Derp: 1585.797ms
CS3Derp: 376.0937ms
CS4Derp: 271.1663ms

所以，我的第四种方法绝对是最好的选择，但是实际上，如果一个360万字符的文件在最坏情况下只需要1586毫秒处理100次，那么所有这些方法可以说都不重要。

顺便说一句，我还使用了CountSubstr和CountChar方法，在这个360万字符的文件中查找'd'字符，并进行了100次扫描。结果如下...

CS1  d : 2606.9513ms
CS2  d : 339.7942ms
CS3  d : 960.281ms
CS4  d : 233.3442ms
CC1  d : 302.4122ms
CC2  d : 280.7719ms
CC3  d : 299.1125ms
CC4  d : 292.9365ms

根据此文，原帖的方法对于大量数据中的单个字符的搜索非常低效。请注意：所有值都已更新到发布版本的输出。我第一次发布时不小心忘记了在发布模式下构建。我的某些声明已经被修正。

- Adam Higgins · Answer 10

string search = "/string";
var occurrences = (regex.Match(search, @"\/")).Count;

这将计算程序准确找到“/s”的次数（区分大小写），并将出现次数存储在变量“occurrences”中。