MapReduce矩阵乘法复杂度

Question

MapReduce矩阵乘法复杂度

algorithmmapreducecomplexity-theorymatrix-multiplication

4

假设我们有一个大文件，其中包含两个矩阵（A和B）单元格的描述：

+---------------------------------+
|  i  |  j  |  value  |   matrix  |
+---------------------------------+
|  1  |  1  |   10    |     A     |
|  1  |  2  |   20    |     A     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
|  1  |  1  |    5    |     B     |
|  1  |  2  |    7    |     B     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
+---------------------------------+

我们希望计算这些矩阵的乘积：C = A x B
根据定义：C_i_j = sum( A_i_k * B_k_j )

以下是一个两步MapReduce算法，用于计算这个乘积（我将提供伪代码）：

第一步：

function Map (input is a single row of the file from above):

    i = row[0]
    j = row[1]
    value  = row[2]
    matrix = row[3]

    if(matrix == 'A')
        emit(i, {j, value, 'A'})
    else
        emit(j, {i, value, 'B'})

这个 Map 函数的复杂度是 O(1)。

function Reduce(Key, List of tuples from the Map function):

    Matrix_A_tuples = 
        filter( List of tuples from the Map function, where matrix == 'A' )

    Matrix_B_tuples = 
        filter( List of tuples from the Map function, where matrix == 'B' )

    for each tuple_A from Matrix_A_tuples
        i = tuple_A[0]
        value_A = tuple_A[1]

        for each tuple_B from Matrix_B_tuples
            j = tuple_B[0]
            value_B = tuple_B[1]

            emit({i, j}, {value_A * value_b, 'C'})

这个Reduce函数的复杂度是O(N^2)。

第一步完成后，我们会得到一个类似下面这个文件的东西（包含O(N^3)行）：

+---------------------------------+
|  i  |  j  |  value  |   matrix  |
+---------------------------------+
|  1  |  1  |   50    |     C     |
|  1  |  1  |   45    |     C     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
|  2  |  2  |    70   |     C     |
|  2  |  2  |    17   |     C     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
+---------------------------------+

因此，我们只需要对包含相同值 i 和 j 的行的值求和。

第二步：

function Map (input is a single row of the file, which produced in first step):
    i = row[0]
    j = row[1]
    value = row[2]
    emit({i, j}, value)

function Reduce(Key, List of values from the Map function)

    i = Key[0]
    j = Key[1]

    result = 0;

    for each Value from List of values from the Map function
        result += Value

    emit({i, j}, result)

在第二步完成后，我们将得到一个包含矩阵单元格 C 的文件。

那么问题来了：

考虑到MapReduce集群中有多个实例，估算所提供算法的复杂度的最正确方式是什么？

首先想到的方法是：

当我们假设MapReduce集群中实例的数量为 K 时。由于第一步生成的文件行数为 O(N^3)，因此总体复杂度可以估算为 O((N^3)/K)。

但是这种估算没有考虑诸多细节，例如MapReduce集群实例之间的网络带宽、数据在距离之间的分配能力 - 并且执行大部分计算等局部操作。

因此，我想知道估计所提供的MapReduce算法效率的最佳方法是什么，并且使用大O表示法来估算MapReduce算法的效率是否有意义？

- stemm

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dhoha · Accepted Answer

正如你所说，大O表示法估计计算复杂度，但不考虑网络问题（带宽、拥塞、延迟等）。

如果要计算实例之间的通信效率，您需要其他网络指标……

然而，我想告诉您一些事情：如果您的文件不够大，则无法在执行速度方面看到改进。这是因为MapReduce仅在处理大数据时才有效率。此外，您的代码有两个步骤，这意味着有两个作业。从一个作业到另一个作业的MapReduce需要时间来上传文件并重新启动作业。这可能会稍微影响性能。

我认为您可以通过速度和时间来计算效率，因为相比于顺序算法，MapReduce方法在处理大数据时肯定更快。

此外，效率还可以与容错有关。这是因为MapReduce将自行管理故障。因此，程序员无需处理实例故障或网络故障。