向量化一个极其缓慢的 groupBy

4
我有一个数据框架,大部分已经向量化,但是对于一些列需要使用groupBy循环。对于小数据集来说速度可以接受,但是对于50k+行以上的任何数据集,速度变得非常慢。
基本思路是当列“unique”具有值(np.isfinite)时,等待一定数量的天数(例如4天),并将“complete”设置为“True”。重复执行。应忽略4个时间段(天数)之间的正结果。
这就是我现在拥有的,它完全可用,但是速度非常慢。我非常想知道如何将其向量化。
times = np.arange(datetime(2019, 11, 1), datetime(2019, 12, 1), timedelta(days=1)).astype(datetime)
times = np.concatenate([times, times])
names = np.array(['ALFA'] * 30 + ['BETA'] * 30)

unique = np.random.randn(60)
unique[unique < 0.7] = np.nan

df = pd.DataFrame({'unique':unique, 'complete':np.nan}, index=[names, times])
df.index = df.index.set_names(['Name', 'Date'])

df['num'] = df.groupby('Name').cumcount()
entryNum, posit = len(df.index)+1, 0

for n, group in df.groupby(level=['Name']):
    posit = 0
    for date, col in group.groupby(level=['Date']):
        if col.num[0] - entryNum == 4:
            posit = 0
            df.loc[(n, date), 'complete'] = True
        if not posit and np.isfinite(col.unique[0]):
            posit = 1
            entryNum = col.num[0]
rafaelc的方案很棒,但在某些情况下会有所不同:

测试unique列的数据集:

unique = [0.808154, np.nan, np.nan, 0.976455, np.nan, 1.81917, np.nan, 0.732306, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 0.878656, np.nan, 1.087899, 1.57941, 1.211292, np.nan, 1.431411, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1.323002, 1.339211, np.nan, np.nan, 1.322755, np.nan, 0.960014, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1.833514, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 2.3884, np.nan, np.nan, 1.372292, np.nan, np.nan]

输出:

                   unique complete  countnonnull  solution
Name Date
ALFA 2019-11-01  0.808154      NaN           1.0     False
     2019-11-02       NaN      NaN           1.0     False
     2019-11-03       NaN      NaN           1.0     False
     2019-11-04  0.976455      NaN           2.0     False
     2019-11-05       NaN     True           1.0      True
     2019-11-06  1.819170      NaN           2.0     False
     2019-11-07       NaN      NaN           2.0     False
     2019-11-08  0.732306      NaN           2.0     False
     2019-11-09       NaN      NaN           2.0     False
     2019-11-10       NaN     True           1.0     False
     2019-11-11       NaN      NaN           1.0     False
     2019-11-12       NaN      NaN           0.0     False
     2019-11-13       NaN      NaN           0.0     False
     2019-11-14       NaN      NaN           0.0     False
     2019-11-15       NaN      NaN           0.0     False
     2019-11-16       NaN      NaN           0.0     False
     2019-11-17       NaN      NaN           0.0     False
     2019-11-18  0.878656      NaN           1.0     False
     2019-11-19       NaN      NaN           1.0     False
     2019-11-20  1.087899      NaN           2.0     False
     2019-11-21  1.579410      NaN           3.0     False
     2019-11-22  1.211292     True           3.0      True
     2019-11-23       NaN      NaN           3.0     False
     2019-11-24  1.431411      NaN           3.0     False
     2019-11-25       NaN      NaN           2.0     False
     2019-11-26       NaN     True           1.0     False
     2019-11-27       NaN      NaN           1.0     False
     2019-11-28       NaN      NaN           0.0     False
     2019-11-29       NaN      NaN           0.0     False
     2019-11-30       NaN      NaN           0.0     False
BETA 2019-11-01  1.323002      NaN           1.0     False
     2019-11-02  1.339211      NaN           2.0     False
     2019-11-03       NaN      NaN           2.0     False
     2019-11-04       NaN      NaN           2.0     False
     2019-11-05  1.322755     True           2.0      True
     2019-11-06       NaN      NaN           1.0     False
     2019-11-07  0.960014      NaN           2.0     False
     2019-11-08       NaN      NaN           2.0     False
     2019-11-09       NaN     True           1.0     False
     2019-11-10       NaN      NaN           1.0     False
     2019-11-11       NaN      NaN           0.0     False
     2019-11-12       NaN      NaN           0.0     False
     2019-11-13       NaN      NaN           0.0     False
     2019-11-14  1.833514      NaN           1.0     False
     2019-11-15       NaN      NaN           1.0     False
     2019-11-16       NaN      NaN           1.0     False
     2019-11-17       NaN      NaN           1.0     False
     2019-11-18       NaN     True           0.0      True
     2019-11-19       NaN      NaN           0.0     False
     2019-11-20       NaN      NaN           0.0     False
     2019-11-21       NaN      NaN           0.0     False
     2019-11-22       NaN      NaN           0.0     False
     2019-11-23       NaN      NaN           0.0     False
     2019-11-24       NaN      NaN           0.0     False
     2019-11-25  2.388400      NaN           1.0     False
     2019-11-26       NaN      NaN           1.0     False
     2019-11-27       NaN      NaN           1.0     False
     2019-11-28  1.372292      NaN           2.0     False
     2019-11-29       NaN     True           1.0      True
     2019-11-30       NaN      NaN           NaN     False
2个回答

1
让我们来玩一下 shift。下面我的输出 solution 将有 FalseTrue 值,但你可以轻松地将这些值映射回 NaN,以 完全 符合您的期望,但这只是一个细节。
g = df.groupby(level=0)['unique']

df['countnonnull'] = (pd.DataFrame({f'shift_{i}': g.shift(i) for i in range(1,5)})
                        .notnull()
                        .sum(1)
                        .shift(-1))

df['solution'] = (df['unique'].where(df['countnonnull'].eq(1))
                              .groupby(level=0)
                              .shift(4)
                              .notnull())

                   unique expected  solution
Name Date                                   
ALFA 2019-11-01  1.624345      NaN     False
     2019-11-02       NaN      NaN     False
     2019-11-03       NaN      NaN     False
     2019-11-04       NaN      NaN     False
     2019-11-05  0.865408     True      True
     2019-11-06       NaN      NaN     False
     2019-11-07  1.744812      NaN     False
     2019-11-08       NaN      NaN     False
     2019-11-09       NaN     True      True
     2019-11-10       NaN      NaN     False
     2019-11-11  1.462108      NaN     False
     2019-11-12       NaN      NaN     False
     2019-11-13       NaN      NaN     False
     2019-11-14       NaN      NaN     False
     2019-11-15  1.133769     True      True
     2019-11-16       NaN      NaN     False
     2019-11-17       NaN      NaN     False
     2019-11-18       NaN      NaN     False
     2019-11-19       NaN     True      True
     2019-11-20       NaN      NaN     False
     2019-11-21       NaN      NaN     False
     2019-11-22  1.144724      NaN     False
     2019-11-23  0.901591      NaN     False
     2019-11-24       NaN      NaN     False
     2019-11-25  0.900856      NaN     False
     2019-11-26       NaN     True      True
     2019-11-27       NaN      NaN     False
     2019-11-28       NaN      NaN     False
     2019-11-29       NaN      NaN     False
     2019-11-30       NaN      NaN     False

1
你的解决方案确实给出了正确的输出。但我仍然认为这样的累积阈值处理无法进行向量化,必须按顺序完成。我一定哪里错了。 - Quang Hoang
@misantroop 很有趣。你能提供一个可重现的例子吗? - rafaelc
@QuangHoang 是的,那也是我的印象。嗯,我通过创建移位列(在这种情况下 N = 4)来“绕过”循环的需要,这意味着可能会消耗大量内存.. - rafaelc
@rafael,请使用unique中所有的1来尝试您的解决方案。 - Quang Hoang
1
我会尽快查看! - rafaelc
显示剩余2条评论

1
这是我的方法,只使用一次 groupby:
def update(v, thresh=4):
    ret = v.copy()
    count = 5
    for i in ret.index:
        count += 1
        if ret.loc[i]:
            if count >= 4:
                count = 0
            else:
                ret.loc[i] = np.nan
    return ret

groups = df.groupby('Name')
df['f_complete'] = groups['unique'].shift(4).notnull()
df['f_complete']= groups['f_complete'].apply(update)

谢谢。我看到了正确的输出,但我不确定是否应该使用循环。稍后我会尝试在一个大数据集上进行计时。 - misantroop
1
就像我评论 Rafael 回答的时候一样,这种动态确实需要顺序处理,即循环。使用 apply 可以在 Name 上进行矢量化,但在每个 Name 内部,恐怕你没有太多选择。 - Quang Hoang
令人惊讶的是,尽管循环处理了一个22k行数据集,但它甚至比向量化方法快2-3%,比原始方法快一个数量级。与我的方法相比,差异还在不断增加(非线性)。出乎意料! - misantroop
什么是向量化方法?此外,只需循环遍历5万行一次并不像听起来那么糟糕。 - Quang Hoang
由rafaelc编写。我仍在寻找额外的速度提升,如果可能的话,可以应用numba。 - misantroop

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接