这个数组比较问题最好的算法是什么？

Question

这个数组比较问题最好的算法是什么？

14

什么是解决以下问题最有效的速度算法？

给定6个数组，D1、D2、D3、D4、D5和D6，每个数组都包含6个数字，如下所示：

D1[0] = number              D2[0] = number      ......       D6[0] = number
D1[1] = another number      D2[1] = another number           ....
.....                       ....                ......       ....
D1[5] = yet another number  ....                ......       ....

给定第二个数组ST1，其中包含一个数字：

ST1[0] = 6

给定第三个数组 ans，包含 6 个数字：

ans[0] = 3, ans[1] = 4, ans[2] = 5, ......ans[5] = 8

使用ST1[0]中存储的数字减一作为数组D1、D2、D3、D4、D5和D6的下标，例如在此示例中为6，因此从0到6-1进行比较，将ans数组与每个D数组进行比较。如果某个索引处的ans数值未在任何一个D数组中找到，则结果应为0；如果所有ans数值都在某个D数组的相同索引处找到，则结果应为1。也就是说，如果某个ans[i]不等于任何DN[i]，则返回0，如果每个ans[i]都等于某个DN[i]，则返回1。

到目前为止，我的算法是：
我尽可能地保持了一切非循环状态。

EML  := ST1[0]   //number contained in ST1[0]   
EML1 := 0        //start index for the arrays D 

While EML1 < EML
   if D1[ELM1] = ans[0] 
     goto two
   if D2[ELM1] = ans[0] 
     goto two
   if D3[ELM1] = ans[0] 
     goto two
   if D4[ELM1] = ans[0] 
     goto two
   if D5[ELM1] = ans[0] 
     goto two
   if D6[ELM1] = ans[0] 
     goto two

   ELM1 = ELM1 + 1

return 0     //If the ans[0] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers


two:

EML1 := 0      start index for arrays Ds 
While EML1 < EML
   if D1[ELM1] = ans[1] 
     goto three
   if D2[ELM1] = ans[1] 
     goto three
   if D3[ELM1] = ans[1] 
     goto three
   if D4[ELM1] = ans[1] 
     goto three
   if D5[ELM1] = ans[1] 
     goto three
   if D6[ELM1] = ans[1] 
     goto three
   ELM1 = ELM1 + 1

return 0    //If the ans[1] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

three:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[2] 
     goto four
   if D2[ELM1] = ans[2] 
     goto four
   if D3[ELM1] = ans[2] 
     goto four
   if D4[ELM1] = ans[2] 
     goto four
   if D5[ELM1] = ans[2] 
     goto four
   if D6[ELM1] = ans[2] 
     goto four
   ELM1 = ELM1 + 1

return 0   //If the ans[2] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

four:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[3] 
     goto five
   if D2[ELM1] = ans[3] 
     goto five
   if D3[ELM1] = ans[3] 
     goto five
   if D4[ELM1] = ans[3] 
     goto five
   if D5[ELM1] = ans[3] 
     goto five
   if D6[ELM1] = ans[3] 
     goto five
   ELM1 = ELM1 + 1

return 0 //If the ans[3] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers


five:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[4] 
     goto six
   if D2[ELM1] = ans[4] 
     goto six
   if D3[ELM1] = ans[4] 
     goto six
   if D4[ELM1] = ans[4] 
     goto six
   if D5[ELM1] = ans[4] 
     goto six
   if D6[ELM1] = ans[4] 
     goto six
   ELM1 = ELM1 + 1

return 0  //If the ans[4] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

six:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[5] 
     return 1            ////If the ans[1] number is not found in either D1[0-6].....  
   if D2[ELM1] = ans[5]      return 1 which will then include ans[0-6] numbers
     return 1
   if D3[ELM1] = ans[5] 
     return 1
   if D4[ELM1] = ans[5] 
     return 1
   if D5[ELM1] = ans[5] 
     return 1
   if D6[ELM1] = ans[5] 
     return 1
   ELM1 = ELM1 + 1

return 0

作为首选语言，它将是纯C。

- Mark

13

我认为你的编程技能很基础。很可能你想做的事情可以更容易地完成。请写更多关于你想用这段代码做什么的信息（数组表示什么以及你想从中提取哪些信息），这样可以澄清问题并获得更多答案。 - schnaader

21

噢，拜托了各位。作为一名初次使用者，他显然在努力尝试以最佳方式格式化和措辞他的问题。+1 - Lieven Keersmaekers

4

同意Lieven的观点... 即使是初学者，我们也不希望有人因提出问题而感到不舒服，尤其是对于一些合理的教育/学习问题。否则，没有与真实世界开发者建立联系，怎么能成为一个更强大的开发者呢？ - DRapp

9

@mark：我想为Stackoverflow警察部门对待你的方式道歉。 - Tom

2

你能解释一下你想要什么吗？在你给出的算法中，我认为只有前两个循环会运行，因为在所有的循环中，要么循环结束并返回代码，要么循环会遇到“goto two”并跳转到第二个循环。此外，当你说“将每个res数组与每个D数组进行比较”时，程序应该如何处理这些比较？你想打印一系列字符串“greater”，“less than”等，还是在遇到相等的数字时退出，或者其他什么操作？ - Noah Lavine

显示剩余16条评论

4个回答

1

我对你的问题有点困惑，但我认为我已经足够了解它，可以帮助你入门。

#define ROW 6
#define COL 6

int D[ROW][COL]; // This is all of your D arrays in one 2 dimensional array.

接下来，您可能应该使用嵌套的for循环。每个循环将对应于D的一个维度。请记住，索引的顺序很重要。在C语言中保持清晰的最简单方法是记住，即使D有多个维度（并且将计算为指向行的指针：子数组），D[i]也是有效的表达式。

如果您无法将独立的D数组更改为一个多维数组，则可以轻松地创建一个指针数组，其成员指向每个数组的头部，并实现相同的效果。

然后，您可以使用break语句在确定当前的D[i]与ans不匹配后跳出内部循环。

- nategoose

我不想使用二维数组，我需要有6个不同的数组，并且尽可能不要循环。 - Mark

我感兴趣的是加速那段我用算法形式表达的代码，而且它必须使用一维数组。 - Mark

如果您使用了一个能够进行循环展开的编译器，并且打开了该优化选项，那么它很可能会产生类似于您尝试使用goto语句实现的效果，而不会让您的教师感到不满意。 - nategoose

1

@mark：除了让教练不高兴之外，goto还有另一个影响，它会清空处理器的流水线（任何跳转都是如此，因此在这方面，循环也不比goto好）。但只要你能够在不使用任何goto/branch/jump的情况下表达程序，即使执行的指令比严格必要的指令多，也可以加快速度。我相信在这方面有一条可行的路线（将尝试一下）。 - kriss

@mark：仅仅因为你声明了一个变量并不意味着编译器必须为其使用内存。现代编译器非常聪明。Goto是一个问题，因为使用标签和goto通常会导致代码难以阅读。如果代码的读者不得不不断地寻找标签或goto才能理解代码，那么这些代码就很糟糕。我并不完全反对它们，但我不是你需要打动的人。我相当确定，如果你用循环在C中编写此代码，但使用高度优化进行编译，你最终会得到良好的代码。如果这是一个函数，那么你可能想要使用restrict关键字。 - nategoose

显示剩余5条评论

0

如果数字的范围是有限的，那么制作一个位数组可能会更容易，就像这样：

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    uint32_t bit_mask = 0;
    for(int i = 0; i < 6; ++ i) {
        for(int j = 0; j < ST1; ++ j) {
            assert(arrays[i][j] >= 0 && arrays[i][j] < 32); // range is limited
            bit_mask |= 1 << arrays[i][j];
        }
    }
    // make a "list" of numbers that we have

    for(int i = 0; i < 6; ++ i) {
        if(((bit_mask >> ans[i]) & 1) == 0)
            return 0; // in ans, there is a number that is not present in arrays
    }
    return 1; // all of the numbers were found
}

这将始终以O(6 * ST1 + 6)运行。现在它的缺点是首先要通过多达36个数组，然后检查六个值。如果有一个强的前提条件，即数字大部分会出现，那么可以反转测试并提供早期退出：

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    uint32_t bit_mask = 0;
    for(int i = 0; i < 6; ++ i) {
        assert(ans[i][j] >= 0 && ans[i][j] < 32); // range is limited
        bit_mask |= 1 << ans[i];
    }
    // make a "list" of numbers that we need to find

    for(int i = 0; i < 6; ++ i) {
        for(int j = 0; j < ST1; ++ j)
            bit_mask &= ~(1 << arrays[i][j]); // clear bits of the mask

        if(!bit_mask) // check if we have them all
            return 1; // all of the numbers were found
    }

    assert(bit_mask != 0);
    return 0; // there are some numbers remaining yet to be found
}

这段程序最多会在O（6 * ST1 + 6）的时间内运行，在最佳情况下，如果第一个数组中的第一个数字覆盖了所有ans（而且ans是相同数字的六倍），则最多只需要O（6 + 1）的时间。请注意，位掩码为零的测试可以放在每个数组之后（如现在所示），也可以放在每个元素之后（那种方式涉及更多的检查，但在找到所有数字时可以更早地截止）。在CUDA的上下文中，算法的第一版可能会更快，因为它涉及较少的分支，并且除了ST1的循环外，大多数循环都可以自动展开。

然而，如果数字的范围是无限的，我们可以做其他的事情。由于ans和所有数组中最多仅有7 * 6 = 42个不同的数字，因此将这些数字映射到42个不同的数字并使用64位整数作为位掩码是可能的。但可以说，对数字进行整数映射已经足够通过测试，甚至可以完全跳过该测试。

另一种方法是对数组进行排序，并简单计算各个数字的覆盖率：

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    int all_numbers[36], n = ST1 * 6;
    for(int i = 0; i < 6; ++ i)
        memcpy(&all_numbers[i * ST1], &arrays[i], ST1 * sizeof(int));
    // copy all of the numbers into a contiguous array

    std::sort(all_numbers, all_numbers + n);
    // or use "C" standard library qsort() or a bitonic sorting network on GPU
    // alternatively, sort each array of 6 separately and then merge the sorted
    // arrays (can also be done in parallel, to some level)

    n = std::unique(all_numbers, all_numbers + n) - all_numbers;
    // this way, we can also remove duplicate numbers, if they are
    // expect to occur frequently and make the next test faster.
    // std::unique() actually moves the duplicates to the end of the list
    // and returns an iterator (a pointer in this case) to one past
    // the last unique element of the list - that gives us number of
    // unique items.

    for(int i = 0; i < 6; ++ i) {
        int *p = std::lower_bound(all_numbers, all_numbers + n, ans[i]);
        // use binary search to find the number in question
        // or use "C" standard library bfind()
        // or implement binary search yourself on GPU

        if(p == all_numbers + n)
            return 0; // not found
        // alternately, make all_numbers array of 37 and write
        // all_numbers[n] = -1; before this loop. that will act
        // as a sentinel and will save this one comparison (assuming
        // that there is a value that is guaranteed not to occur in ans)

        if(*p != ans[i])
            return 0; // another number found, not ans[i]
        // std::lower_bound looks for the given number, or for one that
        // is greater than it, so if the number was to be inserted there
        // (before the bigger one), the sequence would remain ordered.
    }

    return 1; // all the numbers were found
}

这个程序在复制时运行时间为O(n)，排序时为O(36 log 36)，如果使用unique则可选O(n)（其中n为6 * ST1），搜索时为O(n log n)（如果使用unique，则n可以小于6 * ST1）。整个算法因此以线性对数时间运行。请注意，这不涉及任何动态内存分配，因此即使在GPU平台上也适用（需要实现排序和移植std::unique()和std::lower_bound()，但所有这些都是相当简单的函数）。

- the swine

0

只需比较36个值，最有效的方法是完全不使用CUDA。

只需使用CPU循环即可。

如果您更改输入，我会改变我的答案。

- Danny Varod

不，那只是一个例子，但还有很多可以比较的。 - Mark

你想要一个单一的布尔值答案还是每个元素都有一个答案数组？ - Danny Varod

我放弃了那个项目，现在我正在做另一个项目。http://stackoverflow.com/questions/3017591/how-can-i-inprove-this-function-under-cuda - Mark

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- kriss · Accepted Answer

我使用了原帖提供的算法进行了直接而简单的C语言实现。代码在这里。

正如其他人所建议的，首先要做的是卷起代码。展开循环并不会对速度有所帮助，因为它会导致代码缓存失效。我从内部循环开始卷起，得到了这个。然后我卷起了外部循环，并删除了现在无用的goto语句，得到了下面的代码。

编辑：我多次更改了C代码，因为即使它如此简单，似乎在JIT编译或使用CUDA执行时存在问题（而CUDA似乎对错误不太详细）。这就是为什么下面的代码片段使用全局变量的原因......这只是一个简单的实现。我们还没有追求速度。这说明了过早优化的问题。如果我们连让它工作都做不到，为什么要费心让它快呢？如果我相信维基百科的文章，CUDA似乎对您可以使其工作的代码施加了许多限制。也许我们应该使用float而不是int？

#include <stdio.h>

int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 9};
int ST1[1] = {6};
int ans[6] = {1, 4, 5, 6, 7, 9};
int * D[6] = { D1, D2, D3, D4, D5, D6 };

/* beware D is passed through globals */
int algo(int * ans, int ELM){
    int a, e, p;

    for (a = 0 ; a < 6 ; a++){ 
        for (e = 0 ; e < ELM ; e++){
            for (p = 0 ; p < 6 ; p++){
                if (D[p][e] == ans[a]){
                    goto cont;
                }
            }
        }
        return 0; //bad row of numbers found
    cont:;
    }
    return 1;
}

int main(){
    int res;
    res = algo(ans, ST1[0]);
    printf("algo returned %d\n", res);
}

现在很有趣，因为我们可以理解代码在做什么。顺便说一下，在完成这个打包工作时，我纠正了原始问题中的一些怪异之处。我认为这是拼写错误，因为在全局上下文中根本不合逻辑。 - goto总是跳转到两个（它应该已经进展） - 最后一个测试检查ans [0]而不是ans [5]

如果我在上述对原始代码应该执行的内容和您的原始算法没有误解，请Mark纠正我，并且您的原始算法没有错别字。

代码的作用是对于ans中的每个值，检查它是否存在于二维数组中。如果缺少任何数字，则返回0。如果找到所有数字，则返回1。

要获得真正快速的代码，我会使用另一种语言（如Python（或C ++）），其中set是标准库提供的基本数据结构，而不是在C中实现它。然后，我将使用数组的所有值构建一个set（即O（n）），并检查搜索的数字是否存在于set中（即O（1））。最终实现至少从算法角度来看应该比现有代码更快。

以下是Python示例，因为它非常简单（打印true / false而不是1/0，但您可以理解意思）：

ans_set = set(ans)
print len(set(D1+D2+D3+D4+D5+D6).intersection(ans_set)) == len(ans_set)

这里是使用集合的可能的C++实现：

#include <iostream>
#include <set>

int algo(int * D1, int * D2, int * D3, int * D4, int * D5, int * D6, int * ans, int ELM){
    int e, p;
    int * D[6] = { D1, D2, D3, D4, D5, D6 };
    std::set<int> ans_set(ans, ans+6);

    int lg = ans_set.size();

    for (e = 0 ; e < ELM ; e++){
        for (p = 0 ; p < 6 ; p++){
            if (0 == (lg -= ans_set.erase(D[p][e]))){
                // we found all elements of ans_set
                return 1;
            }
        }
    }
    return 0; // some items in ans are missing
}

int main(){
    int D1[6] = {3, 4, 5, 6, 7, 8};
    int D2[6] = {3, 4, 5, 6, 7, 8};
    int D3[6] = {3, 4, 5, 6, 7, 8};
    int D4[6] = {3, 4, 5, 6, 7, 8};
    int D5[6] = {3, 4, 5, 6, 7, 8};
    int D6[6] = {3, 4, 5, 6, 7, 1};

    int ST1[1] = {6};

    int ans[] = {1, 4, 5, 6, 7, 8};

    int res = algo(D1, D2, D3, D4, D5, D6, ans, ST1[0]);
    std::cout << "algo returned " << res << "\n";
}

我们做了一些性能假设：ans的内容应该是排序的或者我们应该以其他方式构造它，我们假设D1..D6的内容在调用算法时会发生变化。因此，我们不需要为其构建一个集合（因为集合构建本身就是O(n)的，如果D1..D6在变化，我们不会得到任何好处）。但是，如果我们多次使用相同的D1..D6调用algo，并且ans发生变化，我们应该做相反的操作，将D1..D6转换为一个更大的集合，我们保持可用状态。

如果我坚持使用C语言，我可以这样做：

- 将D1..D6中所有数字复制到一个唯一的数组中（对于每一行使用memcpy） - 对这个数组的内容进行排序 - 使用二分搜索来检查数字是否可用

由于数据大小非常小，我们也可以尝试进行微小的优化。这可能会更加划算。不确定。

编辑2：CUDA支持的C子集有严格的限制。最严格的限制是我们不应该使用指向主存储器的指针。这将不得不考虑到。这解释了为什么当前的代码不起作用。最简单的更改可能是依次为每个数组D1..D6调用它。为了保持简洁并避免函数调用成本，我们可以使用宏或内联函数。我将发布一个提案。