Python使用“模板列表”创建新列表

Question

Python使用“模板列表”创建新列表

7

假设我有以下内容：

x1 = [1, 3, 2, 4]

同时：

x2 = [0, 1, 1, 0]

具有相同形状

现在我想要“将x2放在x1之上”，并且对应于x2的数字，将x1中所有数字相加。

因此，最终结果为：

end = [1+4 ,3+2]  # end[0] is the sum of all numbers of x1 where a 0 was in x2

这是一个天真的实现，使用列表来进一步澄清问题。

store_0 = 0
store_1 = 0
x1 = [1, 3, 4, 2]
x2 = [0, 1, 1, 0]
for value_x1 ,value_x2 in zip(x1 ,x2):
    if value_x2 == 0:
        store_0 += value_x1
    elif value_x2 == 1:
        store_1 += value_x1

所以我的问题是：有没有一种方法在NumPy中实现这个功能，而不使用循环或者其他更快的方法？

- user15770670

2

它总是只有几个值吗？Numpy表达式x2==1返回一组真/假值，可以用于过滤其他操作。因此，x1[x2==0].sum()和x1[x2==1].sum()执行了你所需要的两个操作。 - Tim Roberts

谢谢，但解决方案需要能够处理更多值的较大数组。 - user15770670

不确定为什么您没有采用@TimRoberts的解决方案。我刚刚使用了10,000元素数组进行了测试，在我的笔记本电脑上只花了不到一秒钟的时间。 - Brad Day

我是指x2数组的范围可以更大。 - user15770670

是我的错，我没有说清楚。 - user15770670

你的意思是 x2 可能是类似于 [0 1 0 2 0 3 ...] 这样的东西？ - Brad Day

7个回答

5

>>> x1 = np.array([1, 3, 2, 7])
>>> x2 = np.array([0, 1, 1, 0])
>>> for index in np.unique(x2):
>>>     print(f'{index}: {x1[x2==index].sum()}')
0: 8
1: 5
>>> # or in one line
>>> [(index, x1[x2==index].sum()) for index in np.unique(x2)]
[(0, 8), (1, 5)]

- Woodford

一个小的改进建议：我不确定，但是我猜 set(x2) 比 np.unique(x2) 快一点。 - mapf

我忘了说，解决方案应该能够处理“存储”值，最好的方法是返回一个包含相应值的数组的函数。因此，x2的范围也可以从0到1000。 - user15770670

@mapf 可能是这样，但这取决于numpy如何实现 unique。一般来说，我相信numpy是高效的。 - Woodford

1

@user15770670 这段代码将处理任何大小的数组和 x2 中的任意索引数量。不确定您还需要什么。 - Woodford

使用 np.vectorize()，这比我的最初的纯numpy答案更好！赞。 - Pierre D

3

能否使用一行代码来解决pandas的问题？

store_0, store_1 = pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum()

或者作为一个字典，用于x2中任意数量的值：

pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum().to_dict()

输出：

{0: 5, 1: 5}

- mcsoini

我忘了说，解决方案应该能够处理“存储”值，最好的方法是返回一个包含相应值的数组的函数。因此，x2的范围也可以从0到1000。 - user15770670

没想到，但是Pandas的表现比Numpy高出8倍！！！ - user15770670

其中一种最快的方法，比纯numpy更快。很不错的东西。 - Pierre D

2

使用压缩

from itertools import compress
result = [sum(compress(x1,x2)),sum(compress(x1, (map(lambda x: not x,x2))))]

- Nk03

1

这将使您的循环扩展到更多的值。我想不出一种numpy单行代码来实现这个。

sums = [0] * 10000
for vx1,vx2 in zip(x1,x2):
    sums[vx2] += vx1

- Tim Roberts

我忘了说，解决方案应该能够处理“存储”值，最好的方法是返回一个包含相应值的数组的函数。因此，x2的范围也可以从0到1000。 - user15770670

我太蠢了 :( 我觉得这样就可以了！ - user15770670

1

将第二个列表转换为布尔数组后，您可以使用它来索引第一个列表：

import numpy as np

x1 = np.array([1, 3, 2, 4])
x2 = np.array([0, 1, 1, 0], dtype=bool)

end = [np.sum(x1[~x2]), np.sum(x1[x2])]
end

[5, 5]

编辑： 如果x2的值可以大于1，您可以使用列表推导式：

x1 = np.array([1, 3, 2, 4])
x2 = np.array([0, 1, 1, 0])

end = [np.sum(x1[x2 == i]) for i in range(max(x2) + 1)]

- Arne

1

这扩展了Tim Roberts在开始时提出的解决方案，但将考虑X2具有多个值，即非二进制。这些值严格相邻，因为for循环使用rng的范围，但可以扩展，使x2具有不相邻的值，例如[0 2 2 2 1 4] <-没有3，而此示例使用的randint将返回类似[0 1 1 3 4 2]的向量。

import numpy as np
rng = 5 # Range of values for x2 i.e [0 1 2 3 4]
x1 = np.random.randint(20, size=10000) #random vector of size 10k
x2 = np.random.randint(5, size=10000) # inexing vector size 10k with range (0-4)


store = []
for i in range(rng): # loop and append to list
    store.append(x1[x2==i].sum())

- Brad Day

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pierre D · Accepted Answer

在这个例子中（以及一般情况下，对于unique、duplicated和groupby等操作），使用pandas比纯numpy解决方案更快：

使用Series的pandas方法（来源：与@mcsoini的答案非常相似）：

def pd_group_sum(x1, x2):
    return pd.Series(x1, index=x2).groupby(x2).sum()

使用纯 numpy 的方法，利用 np.unique 和一些高级索引技巧：

def np_group_sum(a, groups):
    _, ix, rix = np.unique(groups, return_index=True, return_inverse=True)
    return np.where(np.arange(len(ix))[:, None] == rix, a, 0).sum(axis=1)

注：更好的纯numpy方法受@Woodford答案启发：

def selsum(a, g, e):
    return a[g==e].sum()

vselsum = np.vectorize(selsum, signature='(n),(n),()->()')

def np_group_sum2(a, groups):
    return vselsum(a, groups, np.unique(groups))

另外一种纯numpy的方式受到了@mapf有关使用argsort()的评论的启发。但是argsort()本身就需要45毫秒，我们可以尝试使用基于np.argpartition(x2, len(x2)-1)的方法，因为在下面的基准测试中，它仅需要7.5毫秒：

def np_group_sum3(a, groups):
    ix = np.argpartition(groups, len(groups)-1)
    ends = np.nonzero(np.diff(np.r_[groups[ix], groups.max() + 1]))[0]
    return np.diff(np.r_[0, a[ix].cumsum()[ends]])

（稍作修改的）例子

x1 = np.array([1, 3, 2, 4, 8])  # I added a group for sake of generality
x2 = np.array([0, 1, 1, 0, 7])

>>> pd_group_sum(x1, x2)
0    5
1    5
7    8

>>> np_group_sum(x1, x2)  # and all the np_group_sum() variants
array([5, 5, 8])

速度

n = 1_000_000
x1 = np.random.randint(0, 20, n)
x2 = np.random.randint(0, 20, n)

%timeit pd_group_sum(x1, x2)
# 13.9 ms ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np_group_sum(x1, x2)
# 171 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_group_sum2(x1, x2)
# 66.7 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_group_sum3(x1, x2)
# 25.6 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

经过pandas更快，部分原因是由于numpy问题11136。