滚动平均值的窗口大小与基于列值间隔的区间

Question

滚动平均值的窗口大小与基于列值间隔的区间

3

我正在尝试计算一些不完整数据的滚动平均值。我想对第2列的值进行平均，窗口大小为第1列（英里）的值的1.0。我尝试过使用 .rolling()，但是（根据我的有限理解）这仅基于索引创建窗口，而不是基于列值。

import pandas as pd
import numpy as np

df = pd.DataFrame([
        [4.5, 10],
        [4.6, 11],
        [4.8, 9],
        [5.5, 6],
        [5.6, 6],
        [8.1, 10],
        [8.2, 13]
    ])

averages = []
for index in range(len(df)):
    nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
    averages.append(nearby[1].mean())
df['rollingAve'] = averages

提供所需的输出：

     0   1  rollingAve
0  4.5  10        10.0
1  4.6  11        10.0
2  4.8   9        10.0
3  5.5   6         6.0
4  5.6   6         6.0
5  8.1  10        11.5
6  8.2  13        11.5

但是对于大量数据帧，这会显著减慢速度。是否有一种方法可以使用不同的窗口大小来实现.rolling()功能，或类似的东西？

- Sergestus

您的代码似乎没有对齐。 - piterbarg

@piterbarg 谢谢，我已经修复了，现在应该可以运行了。 - Sergestus

2个回答

0

df.rolling和series.rolling如果索引类型为DateTimeIndex或TimedeltaIndex，则允许使用基于值的窗口。您可以使用此方法接近所需结果：

df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)

输出：

     0   1  rolling_mean
0  4.5  10     10.000000
1  4.6  11     10.500000
2  4.8   9     10.000000
3  5.5   6      8.666667
4  5.6   6      7.000000
5  8.1  10     10.000000
6  8.2  13     11.500000

优点这是一个使用Pandas datetime后端处理性能非常高的三行解决方案。

缺点这绝对是一种hack方法，需要将您的英里列（miles column）转换为时间差秒，并且平均值不是居中的（datetimelike和基于偏移量的窗口没有实现center）。

总体而言：如果您重视性能并可以接受非居中平均值，则这将是一个很好的选择，只需要添加一两个注释即可。

- anon01

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pierre D · Accepted Answer

Panda的BaseIndexer非常方便，虽然需要一些琢磨才能正确使用。

接下来，我使用np.searchsorted快速查找每个窗口的索引（开始、结束）：

from pandas.api.indexers import BaseIndexer

class RangeWindow(BaseIndexer):
    def __init__(self, val, width):
        self.val = val.values
        self.width = width

    def get_window_bounds(self, num_values, min_periods, center, closed):
        if min_periods is None: min_periods = 0
        if closed is None: closed = 'left'
        w = (-self.width/2, self.width/2) if center else (0, self.width)
        side0 = 'left' if closed in ['left', 'both'] else 'right'
        side1 = 'right' if closed in ['right', 'both'] else 'left'
        ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
        ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
        ix1 = np.maximum(ix1, ix0 + min_periods)

        return ix0, ix1

一些豪华选项：min_periods、center 和 closed 是根据 DataFrame.rolling 指定实现的。

应用：

df = pd.DataFrame([
        [4.5, 10],
        [4.6, 11],
        [4.8, 9],
        [5.5, 6],
        [5.6, 6],
        [8.1, 10],
        [8.2, 13]
    ], columns='a b'.split())

df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()

# gives:
0    10.0
1    10.0
2    10.0
3     6.0
4     6.0
5    11.5
6    11.5
Name: b, dtype: float64

时间：

df = pd.DataFrame(
    np.random.uniform(0, 1000, size=(1_000_000, 2)),
    columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)


%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()

CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms

性能更新：

在@anon01的评论后，我想知道当滚动窗口较大时是否可以更快地进行滚动。结果我应该先测量Pandas的滚动均值和求和性能...（过早优化，有人吗？）请看最后为什么。

无论如何，想法是只需做一次cumsum，然后取由窗口端点解除引用的元素的差异：

# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
    z = np.concatenate(([0], np.cumsum(b)))
    ix0 = np.searchsorted(a, a - width/2, side='left')
    ix1 = np.searchsorted(a, a + width/2, side='right')
    return z[ix1] - z[ix0]

def fast_rolling_mean(a, b, width):
    z = np.concatenate(([0], np.cumsum(b)))
    ix0 = np.searchsorted(a, a - width/2, side='left')
    ix1 = np.searchsorted(a, a + width/2, side='right')
    return (z[ix1] - z[ix0]) / (ix1 - ix0)

有了这个（以及上面的100万行df），我看到：

%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop

对比：

%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop

然而!!! Pandas很可能已经在做这样的优化（这是一个非常明显的优化）。随着窗口大小的增加，时间并不会增加（这就是为什么我说我应该先检查一下）。