如何快速确定一个数字是否出现在字符串中？

Question

如何快速确定一个数字是否出现在字符串中？

5

这个简单的解决方案很快就浮现在我的脑海中。

#include <ctype.h>

int digit_exists_in
(
    const char *s
)
{
    while (*s)
    {
        if (isdigit(*s))
        {
            return 1;
        }
        else
        {
            s++;
        }
    }

    return 0;
}

int main(void)
{
    int foundDigit = digit_exists_in("abcdefg9ijklmn");

    return 0;
}

还有什么其他技术可以应用来提高速度？

实际被搜索的字符串是可变长度的，字符本身是ASCII码，而不是完整的字符集。这些字符串以NUL结尾。

- EvilTeach

16个回答

13

我会首先使用适当的库函数strcspn，假设该库已经被极端优化：

#include <string.h>
#include <stdio.h>

int digit_exists_in(const char *s)
{
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(void)
{
    printf("%d\n", digit_exists_in("foobar"));
    printf("%d\n", digit_exists_in("foobar1"));
    return 0;
}

如果库还没有被充分优化，将优化放入库中让每个人受益是一个好主意。（你有源代码，对吧？）

- user25148

有趣。我拒绝了strspn，因为文档似乎表明，如果第一个字符不在搜索集中，则该函数将返回0。 - EvilTeach

strspn返回一个整数值，该值指定字符串中仅由strCharSet中的字符组成的子字符串的长度。如果字符串以不在strCharSet中的字符开头，则函数返回0。没有保留返回值来指示错误。 - EvilTeach

我将重新审视strspn和strcspn。+1 - EvilTeach

10

没有更快的算法，但你可以每条指令查看一个完整寄存器的字节或使用SIMD操作来加快速度。你可以使用掩码和零测试来查看在范围内是否有任何数字，或者如果你的SIMD操作在足够大的向量上快速，你可以比进行字符比较更快地迭代测试特定数字值的向量字节。

因此，例如，你可以这样做：

byte buffer[8] = { 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30 };
uint64 *mask = (uint64 *) buffer; //this is just for clarity

if (*((uint64 *) s) & *mask) == 0)
    //You now don't need to do the < '0' test for the next 8 bytes in s

一些优化器可能已经能够从您上面的代码示例中为您完成此操作。但是，只有比较大量的字节才需要考虑在这个级别进行优化。

- Christopher Smith

这会告诉你是否没有字符 0...，P...，p... 然后你会做什么？ - Mike Dunlavey

1

肯定会告诉您当前8个数字中没有数字，因此您会跳到下一个8个数字。 - justinhj

所有数字左半字节都有30。如果你和它们进行与运算，然后将其与掩码进行比较，如果它们相等，则为数字。 - EvilTeach

@Evil：明白，但这段代码并不是这样做的。它只告诉你字符串是否像“AbCdEfGh”，即只有字母和在P或p之前的字母。或者我漏掉了什么？ - Mike Dunlavey

如果您可以一次处理16个字节，那么最好使用特定的掩码来检测每个数字，并进行10次操作，这比逐个比较每个字节要快。您还可以使用第二个掩码来设置范围上限，这样在查找时就已经有很高的概率找到一个数字了。 - Christopher Smith

显示剩余5条评论

7

任何算法都是O(N)的。

我认为isdigit已经非常高效了。

- Brian R. Bondy

1

至少 O(N) :) 实现起来很容易出现悲观情况和“智能”算法 :) - Mihai Limbășan

好的观点，你可能有一些效率不高的算法，但它们可能会更难以实现。 - Brian R. Bondy

为了利用短路，你能不能写成"!(char < '0') && !(char > '9')"？ - Kibbee

isdigit()通常是一个表查找，即比两个比较更快。而且还是区域设置安全的。;-) - DevSolar

isdigit通常是一种位掩码操作 - 比两部分测试更快。 - Jonathan Leffler

在这种形式中+1，通常很难超越标准库的效率。 - DevSolar

2

只是为了好玩，也许是类似于这样的东西：

 // accumulator
unsigned char thereIsADigit = 0;
// lookup table
unsigned char IsDigit[256] = {0,0,0 ..., 1,1,1,1,1,1,1,1,1,0,0,0 ...};

// an unrolled loop, something like:
thereIsADigit |= IsDigit[s[0]];
thereIsADigit |= IsDigit[s[1]];
thereIsADigit |= IsDigit[s[2]];
thereIsADigit |= IsDigit[s[3]];
thereIsADigit |= IsDigit[s[4]];
thereIsADigit |= IsDigit[s[5]];
thereIsADigit |= IsDigit[s[6]];
thereIsADigit |= IsDigit[s[7]];
if (thereIsADigit) break;
s += 8;

在 IBM 360 上，有一个 "translate" 指令可以一步完成此操作。 好的，好的， 克里斯托弗·史密斯的回答让我想到了。假设您只使用 7 位 ASCII 码。以下是使用宽整数算术进行 SIMD 的方法。

假设 C 是包含 4 个字符的 32 位字。

 // compare by subtracting in 8-bit 2s complement arithmetic
( (C + ((0x3a3a3a3a ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char <= '9'
  &
  (0x2f2f2f2f + ((C ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char >= '0'
) // high bit is set for any char <= '9' and >= '0'
& 0x80808080 // look only at the high-order bits
// if any of these 4 bits is 1, there is a digit in C
// if the result is zero, there are no digits in C

这取决于每个字符的高阶位最初为零，以便进位不会传播到该位。（我相信这可以简化。）

- Mike Dunlavey

是的，输入集是7位ASCII码。 - EvilTeach

2

使用测试程序并在其中使用性能分析工具，得出以下结果。

      Count      %   % with           Time   Statement
                      child
-----------------------------------------------------------------------------------------------
                                             int test_isdigit(char *s)   
     20,000    0.0    0.0          2,160.4   {   
199,990,000   13.2    9.5     14,278,460.7       while (*s)  
                                                 {   
199,980,000   69.8   49.9     75,243,524.7            if (isdigit(*s)) return 1;  
199,970,000   17.0   12.1     18,312,971.5            s++;    
                                                 }   

     10,000    0.0    0.0          1,151.4       return 0;   
                                             }   

                                             int test_range(char *s)     
     20,000    0.0    0.0          1,734.2   {   
199,990,000   33.6    9.4     14,114,309.7       while (*s)  
                                                 {
199,980,000   32.2    9.0     13,534,938.6           if ((*s >= '0') && 
                                                        (*s <= '9')) return 1;   
199,970,000   34.2    9.5     14,367,161.9           s++;    
                                                 }   

     10,000    0.0    0.0          1,122.2       return 0;   
                                             }   

                                             int test_strcspn(char *s)   
     20,000    0.2    0.0          1,816.6   {   
     20,000   99.8    0.6        863,113.2       return s[strcspn(s, "0123456789")] 
                                                          == '0'; 
                                             }

strcspn可以胜任这个工作。查看它的汇编代码，我发现它构建了一个大小为256的位图，根据搜索字符设置位，然后处理字符串。

每次调用时，位图都会在堆栈上构建一次。

另一种方法是构建并保留位图，并重复使用它。

另一种方法是使用Chris Smith所讲述的技术并行执行操作。

目前来看，strcspn足够了。

- EvilTeach

2

从概念上讲，没有更快的方法。假设您有一个字符串，其中数字的位置似乎是随机的。这迫使您搜索字符串中的每个项目以查找数字，因此从前到后与任何其他搜索机制一样容易首先找到该数字。

- JaredPar

是的，数字位置没有模式可言，因此无法进行有根据或基于缓存的猜测。 - EvilTeach

2

你可以采用多线程的方式，但对于已经非常快速的算法来说，这可能会增加过多的复杂性。

- Kibbee

好的观点，但对于除了最长的字符串之外的所有字符串来说，由于线程开销，这将是一种负优化。 - Iraimbilanja

由于可怕的缓存效应，线程也可能会对自己造成伤害。对于这样的东西，希望真正的限制因素是内存延迟/带宽，使用多个核心可能会使情况变得更糟。 - Christopher Smith

如果字符串足够大，听起来像是一个赢家。 - Milhous

2

真正的问题在于，这个函数优化有多重要？我认为，应该保留简单的解决方案并对其进行性能分析。只有在它成为代码瓶颈时才需要进行优化。

- Patrick McDonald

1

这里有一个版本，可能会更快，但它可以处理空指针...

int digit_exists_in(const char *s)
{
    if (!s)
        return (0);
    while (*s)
        if (isdigit(*s++))
            return (1);
    return (0);
}

- dwc

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- NickZoic · Accepted Answer

liw.fi是对的，顺便说一句。我有点惊讶，因为strcspn要解决比isdigit()更一般的问题，但似乎是这样的：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>

#define NTESTS 10000
#define TESTSIZE 10000

char stest1[TESTSIZE];
char stest2[TESTSIZE];

int test_isdigit(char *s) {
    while (*s) {
        if (isdigit(*s)) return 1;
        s++;
    }
    return 0;
}

int test_range(char *s) {
    while (*s) {
        if ((*s >= '0') && (*s <= '9')) return 1;
        s++;
    }
    return 0;
}

int test_strcspn(char *s) {
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(int argc, char **argv) {
    long int i;
    for (i=0; i<TESTSIZE; i++) {
        stest1[i] = stest2[i] = 'A' + i % 26;
    }
    stest2[TESTSIZE-1] = '5';

    int alg = atoi(argv[1]);

    switch (alg) {
        case 0:        
            printf("Testing strcspn\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_strcspn(stest1) == 0);
                assert(test_strcspn(stest2) != 0);
            }
            break;
        case 1:
            printf("Testing isdigit() loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_isdigit(stest1) == 0);
                assert(test_isdigit(stest2) != 0);
            }
            break;
        case 2:
            printf("Testing <= => loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_range(stest1) == 0);
                assert(test_range(stest2) != 0);
            }
            break;
        default:
            printf("eh?\n");
            exit(1);
    }    

    return 0;
}

想要在标准库所擅长的领域中超越它们实在是非常困难...(通常的附带条件适用——个人体验可能有所不同)

$ gcc -O6 -Wall -o strcspn strcspn.c 

$ time ./strcspn 0
Testing strcspn

real    0m0.085s
user    0m0.090s
sys 0m0.000s

$ time ./strcspn 1
Testing isdigit() loop

real    0m0.753s
user    0m0.750s
sys 0m0.000s

$ time ./strcspn 2
Testing <= => loop

real    0m0.247s
user    0m0.250s
sys 0m0.000s

更新：只是为了好玩，我根据Mike Dunlavey的答案添加了一个基于位图查找的版本：

char bitmap[256] = {
        /* 0x00 */ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x10 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x20 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x30 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
};

int test_bitmap(char *s) {
    while (!bitmap[*(unsigned char *)s]) s++;
    return (*s);
}

这个方法略微优于其它方法（大约0.170秒），但仍然无法与strcspn相比。