将numpy数组分组

Question

将numpy数组分组

15

我有一个包含时间序列数据的numpy数组。我想将该数组分成给定长度的等分区间（如果最后一段不是相同大小则舍弃），然后计算每个区间的平均值。

我猜想numpy、scipy或pandas都可以实现这个功能。

示例：

data = [4,2,5,6,7,5,4,3,5,7]

对于 bin size 为 2：

bin_data = [(4,2),(5,6),(7,5),(4,3),(5,7)]
bin_data_mean = [3,5.5,6,3.5,6]

对于3个元素为一组的容器大小：

bin_data = [(4,2,5),(6,7,5),(4,3,5)]
bin_data_mean = [7.67,6,4]

- deltap

2

如果您需要重叠的箱子，请查看pandas.rolling_mean：http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments - Joe Kington

4个回答

6

既然您已经有了一个numpy数组，为了避免使用for循环，您可以使用reshape函数并将新的维度视为bin：

In [33]: data.reshape(2, -1)
Out[33]: 
array([[4, 2, 5, 6, 7],
       [5, 4, 3, 5, 7]])

In [34]: data.reshape(2, -1).mean(0)
Out[34]: array([ 4.5,  3. ,  4. ,  5.5,  7. ])

如果 data 的大小能够被 n 整除，那么这个方法才可行。我将编辑一个修复方案。

看起来 Joe Kington 有一个答案可以解决这个问题。

- TomAugspurger

5

尝试使用标准Python（对于此操作，NumPy并不必要）。假设正在使用Python 2.x：

data = [ 4, 2, 5, 6, 7, 5, 4, 3, 5, 7 ]

# example: for n == 2
n=2
partitions = [data[i:i+n] for i in xrange(0, len(data), n)]
partitions = partitions if len(partitions[-1]) == n else partitions[:-1]

# the above produces a list of lists
partitions
=> [[4, 2], [5, 6], [7, 5], [4, 3], [5, 7]]

# now the mean
[sum(x)/float(n) for x in partitions]
=> [3.0, 5.5, 6.0, 3.5, 6.0]

- Óscar López

2

我同意numpy并不是必需的，但它是一个使用了pandas和numpy的大型机器的一小部分，因此已经存储在numpy数组中。我也更喜欢保持简洁。 - deltap

5

我刚刚编写了一个函数，可以应用于您想要的所有数组大小或维度。

data is your array
axis is the axis you want to been
binstep is the number of points between each bin (allow overlapping bins)
binsize is the size of each bin

func is the function you want to apply to the bin (np.max for maxpooling, np.mean for an average ...)

def binArray(data, axis, binstep, binsize, func=np.nanmean):
    data = np.array(data)
    dims = np.array(data.shape)
    argdims = np.arange(data.ndim)
    argdims[0], argdims[axis]= argdims[axis], argdims[0]
    data = data.transpose(argdims)
    data = [func(np.take(data,np.arange(int(i*binstep),int(i*binstep+binsize)),0),0) for i in np.arange(dims[axis]//binstep)]
    data = np.array(data).transpose(argdims)
    return data

在你的情况下，应该是这样的：

data = [4,2,5,6,7,5,4,3,5,7]
bin_data_mean = binArray(data, 0, 2, 2, np.mean)

或者对于bin大小为3：

bin_data_mean = binArray(data, 0, 3, 3, np.mean)

- Alexandre Kempf

如果我们想将两个具有不同形状的ndarray进行分箱怎么办？这里有一个关键点：数据是耦合的；即数组A上索引N处的值与数组B上索引N处的值是相关联的。只要我使用相同的bin步长和bin大小，我会得到正确的结果吗？ - Can H. Tartanoglu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joe Kington · Accepted Answer

只需使用reshape，然后使用mean(axis=1)。

作为最简单的例子：

import numpy as np

data = np.array([4,2,5,6,7,5,4,3,5,7])

print data.reshape(-1, 2).mean(axis=1)

更一般地，当最后一个箱子不是完整的时，我们需要像这样做：

import numpy as np

width=3
data = np.array([4,2,5,6,7,5,4,3,5,7])

result = data[:(data.size // width) * width].reshape(-1, width).mean(axis=1)

print result