在忽略NaN的情况下使用np.average?

16

我有一个形状为(64,17)的矩阵,对应时间和纬度。我想要进行加权纬度平均,我知道np.average可以做到这一点,因为它可以使用权重参数,而np.nanmean则不能用于加权平均经度。然而,与np.nanmean不同,np.average不会忽略NaN,因此每行的前5个条目包含在纬度平均中,并使整个时间序列充满NaN。

是否有一种方法可以在计算时排除NaN并进行加权平均?

file = Dataset("sst_aso_1951-2014latlon_seasavgs.nc")
sst = file.variables['sst']
lat = file.variables['lat']

sst_filt = np.asarray(sst)
missing_values_indices = sst_filt < -8000000   #missing values have value -infinity
sst_filt[missing_values_indices] = np.nan      #all missing values set to NaN

weights = np.cos(np.deg2rad(lat))
sst_zonalavg = np.nanmean(sst_filt, axis=2)
print sst_zonalavg[0,:]
sst_ts = np.average(sst_zonalavg, axis=1, weights=weights)
print sst_ts[:]

输出:

[ nan nan nan nan nan
 27.08499908 27.33333397 28.1457119 28.32899857 28.34454346
 28.27285767 28.18571472 28.10199928 28.10812378 28.03411865
 28.06411552 28.16529465]

[ nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan]
4个回答

23

您可以像这样创建一个屏蔽数组:

data = np.array([[1,2,3], [4,5,np.NaN], [np.NaN,6,np.NaN], [0,0,0]])
masked_data = np.ma.masked_array(data, np.isnan(data))
# calculate your weighted average here instead
weights = [1, 1, 1]
average = np.ma.average(masked_data, axis=1, weights=weights)
# this gives you the result
result = average.filled(np.nan)
print(result)

这将输出:

[ 2.   4.5  6.   0. ]

我提到过我不能使用np.nanmean,因为它的参数中不包含权重。我正在尝试进行加权平均。 - Cebbie
我已经更新了答案,使用了掩码数组和np.mean函数。 - Alex
我正打算在原帖中编辑一个提及的内容,因为我在处理时间序列时,将数据中的NaN删除也是一个选择,但你比我先下手了! - Cebbie
1
实际上,这仍然不太起作用。我仍然需要计算加权平均值,而np.mean无法实现这一点。当我使用np.average时,它仍然输出NaN值。 - Cebbie
2
我已经更新了我的答案,现在应该可以工作了,你需要使用np.ma.average来处理掩码数组。请注意.ma - Alex

9
您可以简单地使用 weights 乘以输入数组,并沿指定轴忽略NaNs使用 np.nansum 进行求和。因此,对于您的情况,假设在输入数组 sst_filt 上沿着 axis = 1 使用weights,那么求和将是 -
np.nansum(sst_filt*weights,axis=1)

在进行平均值计算时考虑NaN,我们最终会得到:

def nanaverage(A,weights,axis):
    return np.nansum(A*weights,axis=axis)/((~np.isnan(A))*weights).sum(axis=axis)

执行示例 -

In [200]: sst_filt  # 2D array case
Out[200]: 
array([[  0.,   1.],
       [ nan,   3.],
       [  4.,   5.]])

In [201]: weights
Out[201]: array([ 0.25,  0.75])

In [202]: nanaverage(sst_filt,weights=weights,axis=1)
Out[202]: array([0.75, 3.  , 4.75])

如果两个数组都是二维的,并且都有一些NaN,那么您的解决方案是否有效? - user308827

5

我会选择数组中不是NaN的部分,然后使用这些索引来选择权重。

例如:

import numpy as np
data = np.random.rand(10)
weights = np.random.rand(10)
data[[2, 4, 8]] = np.nan

print data
# [ 0.32849204,  0.90310062,         nan,  0.58580299,         nan,
#    0.934721  ,  0.44412978,  0.78804409,         nan,  0.24942098]

ii = ~np.isnan(data)
print ii
# [ True  True False  True False  True  True  True False  True]

result = np.average(data[ii], weights = weights[ii])
print result
# .6470319

编辑:我意识到这种方法不能用于二维数组。在这种情况下,我可能只会将NaN的值和权重设置为零。这将得到与排除这些索引进行计算相同的结果。

在运行np.average之前:

data[np.isnan(data)] = 0;
weights[np.isnan(data)] = 0;
result = np.average(data, weights=weights)

如果你想追踪哪些指数是NaN,可以创建副本。

为什么你的原始解决方案对于二维数组无效? - user308827

1

@deto

第一行代码删除了所有的nan值,否则第二行代码会给出错误的结果。

data[np.isnan(data)] = 0;
weights[np.isnan(data)] = 0;
result = np.average(data, weights=weights)

在运行第一行代码之前,应该先进行备份

data_copy = copy.deepcopy(data)
data[np.isnan(data_copy)] = 0;
weights[np.isnan(data_copy)] = 0;
result = np.average(data, weights=weights)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接