在pandas中计算列中第n个元素的平均值

Question

在pandas中计算列中第n个元素的平均值

4

我有以下数据框：

             df1
index   year   week   a     b     c
 -10    2017    10   45    26    19
  -9    2017    11   37    23    14
  -8    2017    12   21    66    19
  -7    2017    13   47    36    92
  -6    2017    14   82    65    18
  -5    2017    15   68    68    19
  -4    2017    16   30    95    24
  -3    2017    17   21    15    94
  -2    2017    18   67    30    16
  -1    2017    19   10    13    13
   0    2017    20   26    22    18
   1    2017    21   NaN   NaN   NaN
   2    2017    22   NaN   NaN   NaN
   3    2017    23   NaN   NaN   NaN
   4    2017    24   NaN   NaN   NaN
   ...
   53   2018    20   NaN   NaN   NaN

我需要计算每个空单元格在一列中前n个值的平均值，并将此值写入该单元格。其中n等于从零开始的索引数。例如，对于列a中的第一个空单元格，我必须计算索引0和-10之间的平均值。然后对于下一个单元格，计算1和-9之间的平均值，以此类推。对于列a、b和c都要这样做。并且计算始终从index = 1开始。

问题是，列数（如a、b、c）可能不同。但我知道这些列将始终位于week列之后。是否可以将这些计算应用于不确定数量的列，但如果已知这些列将位于week列之后？

我努力查找了任何东西，但没有找到合适的。更新: 如果有帮助的话，从index = 0开始向下的最大行数将为53。

- yanadm

当你说“然后对于下一个单元格在1和-9之间等等”时，这是指a)计算-9和0之间的平均值并忽略1中的NaN，还是b)使用前一次“迭代”计算出的新值计算-9和1之间的平均值？ - jdehesa

@jdehesa，是的，我需要像您在b)中描述的那样在单元格1中使用一个新值。 - yanadm

1

你实际上可以使用loc和切片操作，然后使用drop仅获取a、b、c列（df1.loc[:,'week':].drop('week', axis=1)）。我认为没有纯粹的pandas解决方案（除非有一些pandas魔术师想出了一个），来完成移动平均数的计算（因为你需要对先前计算的平均数进行平均），你可能需要使用Python循环。如果性能至关重要，你可以考虑使用Cython或Numba来加速循环。 - P.Tillmann

2个回答

1

您可以通过使用pandas和numpy进行一些操作来实现类似于这样的功能。假设您知道week列的索引位置（即使您不知道，简单的搜索也可以得到索引），例如，week列是第3列，您可以这样做。

import numpy as np
import pandas as pd
#data is your dataframe name
column_list = list(data.columns.values)[3:]
for column_name in column_list :
    column = data[column_name].values
    #converted pandas series to numpy series
    for index in xrange(0,column.shape[0]):
        #iterating over entries in the column
        if np.isnan(column[index]):
            column[index] = np.nanmean(column.take(range(index-10,index+1),mode='wrap'))

这是一种不太优化的解决方案，但应该能正常工作。它将所有NaN条目替换为前10个被包裹在周围的条目。如果您只想要前10个而不包裹，您只需取小于10的前n个，例如：
new_df[index] = np.nanmean(new_df[max(0,index-10):index+1]) 希望对您有所帮助！

- Rudresh Panchal

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rayhane Mama · Accepted Answer

这可以按照以下方式完成：

n = 11 # in the example of your explanation
df = df1.loc[range(1,df1.index[-1]+1)] # select rows from index 1 above

df应该长这样：

       year  week   a   b   c
index                        
1      2017    21 NaN NaN NaN
2      2017    22 NaN NaN NaN
3      2017    23 NaN NaN NaN
4      2017    24 NaN NaN NaN

那么你：

for s in list(df.index): # iterate through rows with nan values
    for i in range(2,df.columns.size): # iterate through different cols ('a','b','c' or more)
        df1.loc[s,df.columns[i]] = df1.loc[range(s-n,s),df.columns[i]].sum()/n
print(df1)

请注意，在这种情况下，我遵循了您的示例，并假设

 year 始终是第一列， week 始终是第二列，以选择在 week 和 index 之后的所有列。而 index 则代表索引。
输出：
       year  week          a          b          c
index                                             
-10    2017    10  45.000000  26.000000  19.000000
-9     2017    11  37.000000  23.000000  14.000000
-8     2017    12  21.000000  66.000000  19.000000
-7     2017    13  47.000000  36.000000  92.000000
-6     2017    14  82.000000  65.000000  18.000000
-5     2017    15  68.000000  68.000000  19.000000
-4     2017    16  30.000000  95.000000  24.000000
-3     2017    17  21.000000  15.000000  94.000000
-2     2017    18  67.000000  30.000000  16.000000
-1     2017    19  10.000000  13.000000  13.000000
 0     2017    20  26.000000  22.000000  18.000000
 1     2017    21  41.272727  41.727273  31.454545
 2     2017    22  40.933884  43.157025  32.586777
 3     2017    23  41.291510  44.989482  34.276484
 4     2017    24  43.136193  43.079434  35.665255