在分组中对数据进行平均化

Question

在分组中对数据进行平均化

pythonpython-3.xnumpyaveragescientific-computing

9

我有两个列表：一个是深度列表，另一个是叶绿素列表，它们一一对应。我想要每0.5米深度求平均叶绿素数据。

chl  = [0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33]
depth = [0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3]

深度分层的长度不总是相等的，也不总是以0.0或0.5间隔开始。但叶绿素数据始终与深度数据协调。叶绿素平均值也不能按升序排列，它们需要按深度正确顺序保持不变。由于深度和叶绿素列表非常长，因此我无法逐个处理。

如何制作具有平均叶绿素数据的0.5米深度分层？

目标：

depth = [0.5,1.0,1.5,2.0,2.5]
chlorophyll = [avg1,avg2,avg3,avg4,avg5]

例如：

avg1 = np.mean(0.4,0.1,0.04,0.05,0.4)

- Adam

你喜欢使用Pandas吗？ - BENY

depth = [0.5,1.0,1.5,2.0,2.5) 是已知的还是需要计算？ - Divakar

深度可以使用linspace来创建。我也可以使用pandas。 - Adam

你只想要numpy/pandas的解决方案，还是普通的Python也可以？ - Patrick Artner

寻找一个numpy解决方案。 - Adam

@Adam 你说 - "深度和叶绿素列表非常长"。那么，你能否在实际数据上对迄今为止发布的不同方法进行计时，假设性能可能具有一定的兴趣？考虑到已经发布了基于NumPy、pandas和scipy的解决方案，看看它们的表现如何会很有趣。 - Divakar

4个回答

4

下面是一个使用 NumPy 向量化的解决方案，利用 np.searchsorted 获取 bin 移位 (indices)，并利用 np.add.reduceat 进行分组求和 -

def bin_data(chl, depth, bin_start=0, bin_length= 0.5):
    # Get number of intervals and hence the bin-length-spaced depth array
    n = int(np.ceil(depth[-1]/bin_length))
    depthl = np.linspace(start=bin_start,stop=bin_length*n, num=n+1)

    # Indices along depth array where the intervaled array would have bin shifts
    idx = np.searchsorted(depth, depthl)

    # Number of elements in each bin (bin-lengths)
    lens = np.diff(idx)

    # Get summations for each bins & divide by bin lengths for binned avg o/p
    # For bins with lengths==0, set them as some invalid specifier, say NaN
    return np.where(lens==0, np.nan, np.add.reduceat(chl, idx[:-1])/lens)

样例运行 -

In [83]: chl
Out[83]: 
array([0.4 , 0.1 , 0.04, 0.05, 0.4 , 0.2 , 0.6 , 0.09, 0.23, 0.43, 0.65,
       0.22, 0.12, 0.2 , 0.33])

In [84]: depth
Out[84]: 
array([0.1  , 0.3  , 0.31 , 0.44 , 0.49 , 1.1  , 1.145, 1.33 , 1.49 ,
       1.53 , 1.67 , 1.79 , 1.87 , 2.1  , 2.3  ])

In [85]: bin_data(chl, depth, bin_start=0, bin_length= 0.5)
Out[85]: array([0.198,   nan, 0.28 , 0.355, 0.265])

- Divakar

3

一种方法是使用numpy.digitize将您的类别分组。

然后使用字典或列表推导式计算结果。

import numpy as np

chl  = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])
depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])

bins = np.array([0,0.5,1.0,1.5,2.0,2.5])

A = np.vstack((np.digitize(depth, bins), chl)).T

res = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}

# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}

或者，如果你需要精确的格式：

res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]

# [nan, 0.198, nan, 0.28, 0.355, 0.265]

- jpp

1

我所做的唯一更改是：bins = np.arange(0.0,50.0,0.5)，因为这使我拥有了更多的控制权，但除此之外，这个方法很有效。 - Adam

3

这里有一种来自 pandas.cut 的方法。

df=pd.DataFrame({'chl':chl,'depth':depth})
df.groupby(pd.cut(df.depth,bins=[0,0.5,1,1.5,2,2.5])).chl.mean()
Out[456]: 
depth
(0.0, 0.5]    0.198
(0.5, 1.0]      NaN
(1.0, 1.5]    0.280
(1.5, 2.0]    0.355
(2.0, 2.5]    0.265
Name: chl, dtype: float64

- BENY

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- miradulo · Accepted Answer

我很惊讶还没有提到 scipy.stats.binned_statistic。你可以直接使用它来计算平均值，并用可选参数指定区间。

from scipy.stats import binned_statistic

mean_stat = binned_statistic(depth, chl, 
                             statistic='mean', 
                             bins=5, 
                             range=(0, 2.5))

mean_stat.statistic
# array([0.198,   nan, 0.28 , 0.355, 0.265])
mean_stat.bin_edges
# array([0. , 0.5, 1. , 1.5, 2. , 2.5])
mean_stat.binnumber
# array([1, 1, 1, ..., 4, 5, 5])