实现Cannon算法

Question

实现Cannon算法

7

背景

我需要实现Cannon算法，这是一种平行矩阵乘法算法，用于乘积的矩阵应为方针且其维度可被处理器数目的平方根整除。我编写了以下代码，它可以正常运行，但在实际运行时并不能正确地将A x B相乘得到新的矩阵C。请您帮忙分析，指导我找出错误所在。显然，这是一道作业题。

代码

void shift_left(datatype** mat, int s, int row, int n, int amount) {
    datatype* temp_buffer = malloc(sizeof(datatype) * n);
    for(int col = 0; col < n; col++) {
        datatype temp = mat[row][(col+amount)%s];
        temp_buffer[(col+amount)%s] = mat[row][col];
        temp_buffer[col] = temp;
    }
    memcpy(mat[row], temp_buffer, n);
    free(temp_buffer);
}

void shift_up(datatype** mat, int s, int col, int n, int amount) {
    datatype* temp_buffer = malloc(sizeof(datatype) * n);
    for(int row = 0; row < n; row++) {
        datatype temp = mat[(row+amount)%s][col];
        temp_buffer[(row+amount)%s] = mat[row][col];
        temp_buffer[row] = temp;
    }
    memcpy(&mat[0][col], temp_buffer, n);
    free(temp_buffer);
}

void cannon_mul(int p_sqrt,datatype** a, datatype** b, datatype** c, int n) {
    /* 2D matrices and n^2 sized only!*/
    int i = 0, j = 0, k = 0;
    int s = p_sqrt;
    for(i = 0; i < (s-1); i++) {
        shift_left(a, s, i, s-1, i); // Skew matrix a
    }
    for (i = 0; i < (s-1); i++) {
        shift_up(b, s, i, s-1, i); // Skew matrix b
    }
    for(k = 0; k < (s-1); k++) {
        for(i = 0; i < (s-1); i++) {
            for(j = 0; j < (s-1); j++) {
                c[i][j] += a[i][j]*b[i][j];
                shift_left(a, s, i, s-1, 1);
                shift_up(b, s, i, s-1, 1);  
            }                       
        }
    }  
}

我认为出了什么问题？

我的直觉是移位不正确，或者我错过了算法的一个重要部分。我的原始移位函数没有使用临时缓冲区，所以这次我想使用临时缓冲区，但它没有产生任何影响。如果有帮助的话，我可以展示一些样本输出，但结果与期望的结果 完全不相近。好消息是它运行得很快 :)

结果

1.48 0.14 9.47 8.99 8.06 0.06 6.68 1.04 4.44 7.50
7.26 8.87 2.21 6.27 2.12 7.91 0.65 5.24 0.45 4.94
0.47 4.13 1.87 2.25 6.83 1.52 6.41 9.14 9.22 8.91
7.34 2.70 6.78 2.78 3.51 4.95 5.27 0.85 9.51 6.82
0.28 6.73 0.70 8.88 7.14 9.09 2.36 5.38 6.43 9.00
7.13 6.71 6.92 9.81 5.13 9.35 7.50 5.16 4.68 3.62
1.30 6.26 4.55 4.27 0.51 2.23 3.19 8.75 6.57 9.07
7.49 6.41 1.04 7.78 7.16 2.78 2.25 6.23 9.42 0.32
3.21 3.60 2.04 2.93 4.29 3.88 2.78 8.01 4.57 6.47
7.52 3.77 0.63 5.97 7.32 4.90 9.63 4.90 8.46 1.90

将上述矩阵自乘，用我的代码计算结果如下：

2.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50.81 0.00 0.00 0.00 0.00 87.51 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

这个顺序程序产生了以下结果：

163.41 212.17 144.32 227.10 251.03 205.60 245.63 277.33 368.99 334.11
257.85 230.82 203.60 314.08 246.02 240.12 228.37 197.90 264.38 228.24
234.13 272.10 110.75 294.84 263.16 242.07 209.54 316.13 339.23 260.51
185.33 215.59 192.26 283.31 270.80 208.38 265.08 291.49 312.24 319.73
313.23 301.95 182.04 348.11 283.20 337.49 266.54 284.57 355.28 281.07
293.25 323.29 281.35 393.92 325.24 313.62 313.48 342.95 418.37 401.91
255.88 238.25 122.17 254.52 243.58 204.49 217.69 273.03 314.89 214.45
219.26 239.07 200.18 309.98 262.21 242.68 190.02 245.85 297.96 308.56
209.03 213.11 126.24 266.48 233.88 199.33 193.28 228.92 277.50 202.27
210.31 264.67 227.59 337.79 261.40 250.35 225.77 295.00 331.92 352.17

重要提示：我仅展示我的程序的相关部分，如果您认为需要展示更多，请告诉我，我会提供更多代码。最后，为什么“作业”标签消失了？

编辑

有人指出缓冲区太小，并且缺少“sizeof”的愚蠢错误已经被更正。我尝试过，结果相同，所以显然问题与此无关。希望在两天内，我可以开启悬赏，吸引一些人至少给我一个线索，指出问题所在。这是一个我似乎无法调试的错误，而我对该算法的理解必须承认几乎为零。我依赖于几乎没有增加我的理解的网络资源。

编辑2

尝试使用calloc进行零分配缓冲区，但它并不改变结果。如此奇怪，但感谢您的建议；我忘记了内存不会自动分配零。

编辑3

我尝试了这个：

void shift_left(datatype** mat, int s, int row, int n, int amount) {

    datatype* temp_buffer = calloc(n, sizeof(datatype) * n);
    for(int col = 0; col < n; col++) {
        /* temp_buffer[(col+amount)%s] = mat[row][col];
        temp_buffer[col] =  mat[row][(col+amount)%s]; */
        temp_buffer[(col+amount)%s] = 0;
        temp_buffer[col] =  0;
    }
    memcpy(mat[row], temp_buffer, sizeof(datatype) * n);
    //free(temp_buffer); 

}

void shift_up(datatype** mat, int s, int col, int n, int amount) {
    datatype* temp_buffer = calloc(n, sizeof(datatype) * n);
    for(int row = 0; row < n; row++) {
      /* temp_buffer[(row+amount)%s] = mat[row][col];
      temp_buffer[row] = mat[(row+amount)%s][col]; */
      temp_buffer[(row+amount)%s] = 0;
      temp_buffer[row] = 0;
    }
    memcpy(&mat[0][col], temp_buffer, sizeof(datatype) * n);
    free(temp_buffer); 
}

令人惊讶的是，结果相同。虽然我已经注释了代码并将其替换为零，应该打印所有零。我的猜测是memcpy没有起作用。

编辑 4

我确认了memcpy是罪魁祸首。但是我不知道为什么，我被难住了，如果数据类型只是double的别名，那么教授因某种奇怪的原因写下了这句话，因为它并没有使代码更易读。

但是如果我自己解决了问题，会很高兴向大家展示解决方案。

- Daniel Lopez

有人能解释一下为什么这个问题要被关闭吗？这是一个独特的问题，在StackOverflow上没有一个回答我的问题。我会感激那些试图关闭的人。 - Daniel Lopez

2

@MichaelDorgan 感谢您建议使用代码审查；但是，由于代码返回不正确的结果，这将与CR的主题无关。一旦代码按预期工作，它可能非常适合那里。 - Phrancis

1

它不存在，Phrancis。我想添加它，但它不存在。我相信他们很久以前就将其删除了。 - Daniel Lopez

2

Stack Overflow绝对是解决代码错误或问题的最佳平台。也许提供一些“预期结果与实际（错误）结果”的比较会有所帮助。 - Phrancis

2

这是如何提问的。鉴于我们在这里看到的许多糟糕问题，这相当令人耳目一新。 - Sinkingpoint

显示剩余14条评论

3个回答

0

在编程方面，你的代码看起来像是一个单元错误（off-by-one error）。

你的有这些循环：

for(i = 0; i < (s-1); i++)

for (i = 0; i < (s-1); i++)

for(k = 0; k < (s-1); k++) 
    for(i = 0; i < (s-1); i++) 
        for(j = 0; j < (s-1); j++)

然而，在我认为相关的源材料http://www.cs.berkeley.edu/~demmel/cs267/lecture11/lecture11.html#link_5中，伪代码如下：

for all (i=0 to s-1)

for all (i=0 to s-1)

for k=0 to s-1
    for all (i=0 to s-1, j=0 to s-1)

我相信这意味着循环变量在循环的最后一次迭代中应该取值s-1。而你的循环变量从未取到s-1。

尝试将你的循环改为<= (s-1)，看看是否有所帮助。

- AakashM

我认为算法完全是错的。在你发布答案之前，我已经尝试过了。但还是谢谢你的帮助。是的，那就是我代码所基于的链接。 - Daniel Lopez

0

这是我的解决方案

void shift_left(int** mat, int i, int n, int amount) {
    int* temp_buffer = (int*)malloc(sizeof(int) * n);
    for (int j = 0; j < n; j++) 
        temp_buffer[j] = mat[i][(j + amount) % n];

    for (int j = 0; j < n; j++)
        mat[i][j] = temp_buffer[j];

    free(temp_buffer);
}

void shift_up(int** mat, int j, int n, int amount) {
    int* temp_buffer = (int*)malloc(sizeof(int) * n);
    for (int i = 0; i < n; i++) 
        temp_buffer[i] = mat[(i + amount) % n][j];

    for (int i = 0; i < n; i++)
        mat[i][j] = temp_buffer[i];

    free(temp_buffer);
}

void cannon_mul(int** mat1, int** mat2, int** rez, int n, int threads) {
    
    int i, j, k, i1, j1;
    for (i = 0; i < n; i++)
        shift_left(mat1, i, n, i);

    for (j = 0; j < n; j++)
        shift_up(mat2, j, n, j);

    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++) 
            rez[i][j] = 0;   

    

    for (k = 0; k < n; k++) {
        for (i = 0; i < n; i++)
            for (j = 0; j < n; j++)
                rez[i][j] += mat1[i][j] * mat2[i][j];

        for (i1 = 0; i1 < n; i1++)
            shift_left(mat1, i1, n, 1);

        for (j1 = 0; j1 < n; j1++)
            shift_up(mat2, j1, n, 1);
    }
      
}

- Rares

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- chux - Reinstate Monica · Accepted Answer

2

字体太小了。

// memcpy(mat[row], temp_buffer, n);
memcpy(mat[row], temp_buffer, n * sizeof(datatype) );

同样适用于 memcpy(&mat[0][col], temp_buffer, n);。

- chux - Reinstate Monica

@Daniel Lopez 是的，代码只复制了 n 个字节。需要复制 n 个数字。 - chux - Reinstate Monica

1

它根本不会改变结果。看起来错误不在那里。但是感谢您注意到缓冲区太小了。我相信这确实有助于使代码更正确。 - Daniel Lopez

@Daniel Lopez 小建议：考虑使用以下 malloc 语法：datatype* temp_buffer = malloc(n * sizeof *temp_buffer); 编写和维护更加容易。对于 memcpy(mat[row], temp_buffer, n * sizeof *(mat[row])); 同理。 - chux - Reinstate Monica

@Daniel Lopez 因为没有显示填充 / 打印数据的代码 - 可能在那里？ - chux - Reinstate Monica

教授编写的代码？我怀疑不是，否则应该假定它能正常工作，否则教授需要找一份新工作 :) 但说真的，代码应该实现大炮算法，其余部分由教授完成，这意味着读取两个输入矩阵并将最终矩阵作为文件生成的工作由教授的代码完成。程序不会打印矩阵，我使用教授提供的另一个程序。 - Daniel Lopez