使用许多属性和字典查找的Python代码优化

Question

使用许多属性和字典查找的Python代码优化

3

我用Python编写了一个程序，它花费大量时间查找对象属性和字典键值。我想知道是否有任何方法可以优化这些查找时间，可能通过C扩展来减少执行时间，或者是否需要在已编译的语言中重新实现该程序。

该程序使用图形实现一些算法。在我们的数据集上运行速度非常慢，因此我使用一个可完成的缩小数据集使用cProfile对代码进行了剖析。绝大部分时间都在一个函数中消耗，并且特别是两个语句——函数内的生成器表达式。

第202行的生成器表达式为：

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

而在第204行的生成表达式是

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

这个上下文函数的源代码如下。 selected_nodes 是 interaction_graph 的 set 类型节点，它是一个 NetworkX Graph 实例。 node_neighbors 是来自 Graph.neighbors_iter() 的迭代器。 Graph 本身使用字典来存储节点和边。它的 Graph.node 属性是一个字典，其中每个节点都有属于其自己的字典存储节点和它们的属性（例如，'weight'）。

这些查找中的每一个都应该是摊销常数时间（即 O(1)），然而，我仍然为这些查找支付了很大的代价。是否有某种方法可以加速这些查找（例如，通过将部分内容编写成 C 扩展），或者我需要将程序移动到编译语言中？

以下是提供上下文的函数的完整源代码；绝大部分执行时间都在这个函数内。

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

这是根据cProfiler排序后的前几个调用，按时间排序。

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 {max}
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 {method 'update' of 'set' objects}

- gotgenes

1

为什么要使用两个循环？neighbors_in_selected_nodes和neighbor_z_scores？为什么不只用一个循环呢？这个两步骤的公式似乎没有引入任何新的东西。为什么要这样做呢？你能否更新问题，解释为什么要使用两个推导而不是一个？ - S.Lott

我想仅考虑节点的权重，条件为：1）是邻居且2）被选中（存在于“selected_nodes”中）。我使用两个生成器表达式来实现此目的：neighbors_in_selected_nodes过滤已选择节点；这个过程链接到neighbor_z_scores以检索权重。链接迭代器应将它们变成一个循环，而不是两个循环；该循环在max()内进行评估。 - gotgenes

这两个生成器表达式确实可以写成一个for循环；第一个表达式表示语句if neighbor in selected_nodes:，第二个表达式表示neighbor_z_scores.append(interaction_graph.node[neighbor]['weight'])。通过使用生成器表达式，我避免了列表的创建和附加操作。从获取邻居到评估max(neighbor_z_scores)，我完全处理迭代器链。 - gotgenes

4个回答

1

如何保持interaction_graph.neighbors_iter(node)的迭代顺序排序（或部分排序使用collections.heapq）？由于您只是想找到最大值，因此可以按降序迭代node_neighbors，第一个在selected_node中的节点必须是selected_node中的最大值。

其次，selected_node会经常更改吗？如果很少更改，您可以通过拥有“interaction_graph.node [neighbor]的列表，以便每次不必重新构建此列表来节省大量迭代。

编辑：回复评论

排序（）需要O（n log n）

不一定，您太注重教科书了。尽管您的教科书上说了什么，但有时候可以通过利用数据的某些结构来打破O（n log n）的限制。如果您一开始就将邻居列表保存在自然排序的数据结构中（例如heapq，二叉树），则无需在每次迭代时重新排序。当然，这是空间和时间的权衡，因为您需要存储冗余的邻居列表，并且在邻居更改时更新邻居列表的代码复杂度。

此外，Python 的 list.sort() 使用 timsort 算法，在几乎排序的数据上非常快（在某些情况下可能平均为 O(n)）。但它仍然无法突破 O(n log n)，这已经被数学证明是不可能的。

在将解决方案驳回为不可能提高性能之前，您需要进行分析。在进行极端优化时，您可能会发现在某些非常特殊的边缘情况下，旧而慢的冒泡排序可能胜过美化后的快速排序或归并排序。

- Lie Ryan

排序是一个有趣的想法。sort() 需要 O(n log n) 的时间复杂度（其中 n 是邻居的数量）；这比线性搜索更昂贵。根据 heapq 文档，heapify() 是 O(n) 的，但我认为每个弹出操作要么是 O(log n) 要么是 O(n)（不确定），这意味着在最坏情况下，它仍然需要两倍于简单的线性循环遍历邻居的操作次数。 - gotgenes

针对第二点，不幸的是，selected_nodes 在每次调用此函数时都会发生变化，因此它不适合缓存。不过想法很好。 - gotgenes

0

尝试直接访问字典并捕获 KeyErrors，根据您的命中/未命中比率可能会更快：

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

或者使用getdefault和列表推导式：

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)

- Lie Ryan

首先，不是selected_nodes成员的邻居会比成员多得多，因此我首先根据这个标准进行过滤。这意味着少量的“weight”查找。我可以保证所有的“weight”查找都会成功，所以在那里使用try-except语句没有任何好处。然而，ignode是一个非常好的想法，并且将为该属性节省许多查找。 - gotgenes

0

在不深入研究您的代码的情况下，尝试使用 itertools 来提高一点速度。

在模块级别添加以下内容：

import itertools as it, operator as op
GET_WEIGHT= op.attrgetter('weight')

更改：

neighbors_in_selected_nodes = (neighbor for neighbor in
        node_neighbors if neighbor in selected_nodes)

转换为：

neighbors_in_selected_nodes = it.ifilter(selected_node.__contains__, node_neighbors)

并且：

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in neighbors_in_selected_nodes)

转化为：

neighbor_z_scores = (
    it.imap(
        GET_WEIGHT,
        it.imap(
            interaction_graph.node.__getitem__,
            neighbors_in_selected_nodes)
    )
)

这些有帮助吗？

- tzot

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mike DeSimone · Accepted Answer

我不明白为什么你的“weight”查找必须采用["weight"]的形式（节点是字典？）而不是.weight的形式（节点是对象）。

如果你的节点是对象，并且没有很多字段，你可以利用__slots__指令来优化它们的存储：

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

编辑：我看了一下你提供的NetworkX链接，有几件事情让我感到不安。首先，在顶部，"dictionary"的定义是"FIXME"。

总的来说，它似乎坚持使用字典而不是可被子类化的类来存储属性。虽然对象上的属性查找可能本质上是一个字典查找，但我不认为使用对象会更糟糕。如果有什么区别，那就是对象属性查找更容易被优化，因为：

对象属性查找非常常见，
对象属性的键空间比字典键的键空间要小得多，因此可以在搜索中使用优化的比较算法，
对象具有__slots__优化，用于这些情况，其中您只有几个字段的对象并且需要对它们进行优化访问。

例如，我经常在表示坐标的类上使用__slots__。对我来说，树节点似乎是另一个明显的用途。

所以当我读到：

节点

一个节点可以是除了None之外的任何可哈希Python对象。

我认为，好的，没问题，但紧接着就是：

节点属性

通过在添加节点时使用关键字/值对或将其分配给指定节点n的G.node[n]属性字典，节点可以具有任意Python对象作为属性。

我想，如果一个节点需要属性，为什么要单独存储它？为什么不把它放在节点里呢？编写一个具有contentString和weight成员的类是否会有害？边缘似乎更疯狂，因为它们被规定为元组而不是您可以子类化的对象。

所以我对NetworkX背后的设计决策感到相当迷失。

如果你被卡住了，我建议将这些字典中的属性移动到实际节点中，或者如果这不是一个选项，使用整数作为键进入您的属性字典，而不是字符串，这样搜索将使用更快的比较算法。

最后，如果你结合你的生成器会怎样呢？

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)