检查一个数组是否存在于一组数组中。

Question

检查一个数组是否存在于一组数组中。

3

我有像这样的固定长度数组：

char x[10] = "abcdefghij";   // could also contain non-printable char

如何有效地测试一个长度固定的数组是否存在于一组10,000个相同长度的数组中？（注意：这些数组是二进制数据数组或以 null 结尾的字符串数组，这在开始时是固定的）

在 Python 中，我会使用一个 set/哈希表/字典而不是列表（因为查找速度非常快，O(1)）。

s = "abcdefghij"
S = {"foo1234567", "bar1234567", ..., "baz9876543"}
print(s in S)  # True or False

如何在C语言中实现同样的功能？（不是C++）

注意：链接的问题如何在C语言中检查字符串是否存在于字符串数组中？是关于一种朴素的方法，没有性能要求（他们循环遍历所有字符串并使用 strcmp）。这里不同之处在于有10k个数组，需要使用另一种方法来提高性能（也许是哈希表？）。

- Basj

1

在C语言中没有标准的哈希表/集合/字典或任何类似的替代品。你有以下选择：a）从头开始自己实现（哈希表、trie等）；b）使用字符串数组并手动检查针是否在其中；c）寻找一个提供高效“字符串集合”抽象的库。 - yeputons

1

请注意，char x[10] = "abcdefghij";缺少NUL终止符，因此它不是一个“字符串”。如果您没有限制数组大小，它将具有一个终止符 - 不要这样做。您无法在此数组上使用字符串处理函数。 - Weather Vane

@WeatherVane 是的，我知道，我提到字符串只是为了简化，但更一般地，它可以是固定长度的字符数组（不一定是可打印字符）。当涉及到字符串时，我当然会使用空终止符。 - Basj

3

对于10000个元素的二分查找，只需要大约13次数据比较，因此不会有很大的差异。 - Weather Vane

@WeatherVane 注意：数组中的一些字符可能是\0x00，这是二进制数据，因此不是空终止符。 - Basj

显示剩余12条评论

3个回答

1

< p > bsearch函数的签名如下：

   void *bsearch(const void *key, const void *base,
                 size_t nmemb, size_t size,
                 int (*compar)(const void *, const void *));

在这里，key 可以是 t1 或者 t2，base 是 T。 nmemb 和 size 分别为 9 和 4。

compar 是指向回调函数的指针，用于执行比较操作。这个函数可以简单地包装一下 memcmp：

int compare_char4(const void *p1, const void *p2)
{
    return memcmp(p1, p2, 4);
}

接下来您需要使用以下参数调用bsearch函数：

printf("%s\n", bsearch(t1, T, 9, 4, compare_char4) ? "found" : "not found");
printf("%s\n", bsearch(t2, T, 9, 4, compare_char4) ? "found" : "not found");

- dbush

谢谢@dbush！它运行得很好，比我的解决方案（试图重新发明轮子）简单多了！在性能方面，它似乎与我的解决方案相似。 - Basj

0

我认为以下二分查找方法是可行的（感谢@WeatherVane），如果数组排序，则复杂度为O(log n)。我们可以使用memcmp进行比较。

int binarysearch(unsigned char *T, unsigned char *t, int num, int size)  // https://en.wikipedia.org/wiki/Binary_search_algorithm#Algorithm
{
    int a = 0;
    int b = num-1;
    int m, res;
    while (a <= b) {
        m = (a + b) / 2;
        res = memcmp(&T[0] + m*size, &t[0], size);
        if (res < 0) {
            a = m + 1;
        } else if (res > 0) {
            b = m - 1;
        } else { 
            return 1;
        } 
    }
    return 0;
}
int main()
{
    unsigned char T[][4] = { 
        { 0x00, 0x6e, 0xab, 0xcd },
        { 0xea, 0x6e, 0xab, 0xcd },
        { 0xeb, 0x6e, 0xab, 0xcd },
        { 0xec, 0x6e, 0xab, 0xcd },
        { 0xed, 0x6e, 0xab, 0xcd },
        { 0xef, 0x6e, 0xab, 0xcd },
        { 0xfa, 0x6e, 0xab, 0xcd },
        { 0xfb, 0x6e, 0xab, 0xcd },
        { 0xff, 0x6e, 0xab, 0xcd }
    };
    unsigned char t1[4] = { 0x33, 0x6e, 0xab, 0xcd };
    unsigned char t2[4] = { 0xfb, 0x6e, 0xab, 0xcd };
    printf("%s\n", binarysearch(T[0], t1, 9, 4) ? "found" : "not found");
    printf("%s\n", binarysearch(T[0], t2, 9, 4) ? "found" : "not found");
}

结果：

未找到
已找到

- Basj

1

考虑使用标准库中的函数bsearch。 - tstanisl

我不知道它在stdlib中，你能否为了以后的参考而与@tstanisl一起发布一个答案？ - Basj

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andreas Wenzel · Accepted Answer

关于在数组中存储空结尾字符串和二进制数据的说明

在您的问题中，您指定固定长度的数组可以表示：

空结尾字符串或
二进制数据。

如果您的程序无法知道数组内容代表这两种可能性中的哪一种，则如果数组内容应该表示空结尾字符串，则必须使用 null 字符填充数组的其余部分（例如使用函数 strncpy）。仅写一个空结尾字符是不够的。

这很重要，因为如果您不这样做，程序将无法知道在遇到值为 0 的字节后如何解释剩余的数据。它不会知道是否：

将此字节解释为字符串的空结尾字符并忽略字符串的所有剩余字节，还是
将此字节解释为二进制数据，并将剩余字节也视为二进制数据（即不忽略剩余字节）。

但是，如果您总是使用 null 字符填充终止空字符后的所有剩余字节，则您的程序将不必担心数组内容表示字符串还是二进制数据，因为它可以以相同的方式处理两者。

因此，在我的回答中，我将假设如果数组内容表示字符串，则在字符串结束后的所有剩余字节都将填充值为 0 的字节。这样，我可以假设不应忽略任何字节。

哈希表解决方案

在 Python 中，我会使用 set/hashtable/dict 而不是 list（因为查找速度非常快：O(1)）：

在 C 中，也可以使用哈希表，但您需要自己编写程序或使用已经存在的库。C 标准库不提供任何哈希函数。

以下代码将随机生成一个包含 10,000 个长度为 10 的固定长度（非空结尾）字符串的数组，其中包含字符 a 到 z、A 到 Z 和 0 到 9，但它还会将三个硬编码的单词放置在数组中的不同位置，以便稍后搜索这些单词，以测试搜索功能。然后，程序将把所有单词插入哈希表中，并执行几个硬编码的哈希表查找。

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10
#define HASH_TABLE_SIZE 10000

struct hash_table_entry
{
    struct hash_table_entry *next;
    unsigned char data[ARRAY_LENGTH];
};

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
size_t hash_array( const unsigned char array[ARRAY_LENGTH] );
void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //declare hash table and initialize all fields to zero
    static struct hash_table_entry *hash_table[HASH_TABLE_SIZE] = {NULL};

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //build hash table
    printf( "Building hash table..." );
    fflush( stdout );
    build_hash_table_from_data( data, NUM_ARRAYS, hash_table );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_lookup( (unsigned char *)"uvwxyz0123", hash_table );
    verbose_lookup( (unsigned char *)"hsdkesldhg", hash_table );
    verbose_lookup( (unsigned char *)"abcdefghij", hash_table );
    verbose_lookup( (unsigned char *)"erlsodn3ls", hash_table );
    verbose_lookup( (unsigned char *)"klmnopqrst", hash_table );

    //cleanup
    printf( "Performing cleanup..." );
    fflush( stdout );
    cleanup_hash_table( hash_table );
    printf( "done\n" );
    fflush( stdout );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

//This is a simple hash function. Depending on the type of data, a
//more complex hashing function may be required, in order to prevent
//hash collisions.
size_t hash_array( const unsigned char array[ARRAY_LENGTH] )
{
    size_t hash = 0;

    for ( size_t i = 0; i < ARRAY_LENGTH; i++ )
    {
        hash = hash * 7 + array[i];
    }

    return hash % HASH_TABLE_SIZE;
}

void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < data_length; i++ )
    {
        struct hash_table_entry **pp, *p;

        //hash the array and store the result
        size_t hash = hash_array( data[i] );

        //perform hash table lookup
        pp = &hash_table[hash];

        //make pp point to the last null pointer in the list
        while ( ( p = *pp ) != NULL )
            pp = &p->next;

        //allocate memory for linked list node
        p = malloc( sizeof *p );
        if ( p == NULL )
        {
            fprintf( stderr, "Memory allocation failure!\n" );
            exit( EXIT_FAILURE );
        }

        //fill in data for new node
        p->next = NULL;
        memcpy( p->data, data[i], ARRAY_LENGTH );

        //insert new node into linked list
        *pp = p;
    }
}

void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < HASH_TABLE_SIZE; i++ )
    {
        struct hash_table_entry *p;

        p = hash_table[i];

        while ( p != NULL )
        {
            struct hash_table_entry *q = p;

            p = p->next;

            free( q );
        }
    }
}

bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    size_t hash;
    struct hash_table_entry *p;

    //find hash
    hash = hash_array( array );

    //index into hash table
    p = hash_table[hash];

    //attempt to find array in linked list
    while ( p != NULL )
    {
        if ( memcmp( array, p->data, ARRAY_LENGTH ) == 0 )
        {
            //array was found
            return true;
        }

        p = p->next;
    }

    //array was not found
    return false;
}

//This function serves as a wrapper function for the
//function "lookup". It prints to "stdout" what it is
//looking for and what the result of the lookup was.
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( lookup( array, hash_table ) )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

这个程序的输出如下:

Generating random data...done
Building hash table...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found
Performing cleanup...done

如您所见，它找到了所有明确插入的3个字符串，并且没有发现其他字符串。

在代码中，我使用unsigned char而不是char，因为在C中，char *通常用于传递以null结尾的字符串。因此，使用unsigned char *传递可以是二进制或不以null结尾的数据似乎更合适。 bsearch解决方案 为了比较，在这里提供了一种解决方案，该方案以相同的方式生成输入数组，但使用bsearch而不是哈希表来搜索：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
int compare_func( const void *a, const void *b );
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //sort the array
    printf( "Sorting array..." );
    fflush( stdout );
    qsort( data, NUM_ARRAYS, ARRAY_LENGTH, compare_func );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_search( (unsigned char *)"uvwxyz0123", data );
    verbose_search( (unsigned char *)"hsdkesldhg", data );
    verbose_search( (unsigned char *)"abcdefghij", data );
    verbose_search( (unsigned char *)"erlsodn3ls", data );
    verbose_search( (unsigned char *)"klmnopqrst", data );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

int compare_func( const void *a, const void *b )
{
    return memcmp( a, b, ARRAY_LENGTH );
}

//This function serves as a wrapper function for the
//function "bsearch". It prints to "stdout" what it is
//looking for and what the result of the search was.
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( bsearch( array, data, NUM_ARRAYS, ARRAY_LENGTH, compare_func ) != NULL )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

该程序的输出如下：

Generating random data...done
Sorting array...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found

哈希表方案可能比二分查找方案更快，前提是为输入选择了一个良好的哈希函数，以使哈希冲突的数量最小化。