如何加快在pandas数据框中计算滚动加权平均值的方法

Question

如何加快在pandas数据框中计算滚动加权平均值的方法

4

我有一个大的DataFrame，我需要计算滚动的逐行加权平均值。

我知道我可以这样做：

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(20000, 50))
weights = [1/9, 2/9, 1/3, 2/9, 1/9]  
rolling_mean = df.rolling(5, axis=1).apply(lambda seq: np.average(seq, weights=weights))

问题是这在我的电脑上需要大约40秒的时间。有没有办法加快这个计算速度？

- younggotti

4个回答

3

快速使用上述答案进行基准测试：

from timeit import timeit

import numba as nb
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(20000, 50))

weights = [1 / 9, 2 / 9, 1 / 3, 2 / 9, 1 / 9]


def rolling_normal(df):
    return df.rolling(5, axis=1).apply(lambda seq: np.average(seq, weights=weights))


def rolling_numba(df):
    return df.rolling(5, axis=1).apply(weighted_mean, engine="numba", raw=True)


def weighted_mean(seq):
    weights = [1 / 9, 2 / 9, 1 / 3, 2 / 9, 1 / 9]
    return np.average(seq, weights=weights)


def rollin_panda_kim(df):
    return sum(df.shift(num, axis=1) * w for num, w in enumerate(weights))


def rolling_safffh(df):
    @nb.njit
    def weighted_average(arr, weights):
        n = len(arr)
        result = np.empty(n - 4)  # The size of the resulting rolling window
        for i in range(n - 4):
            result[i] = np.average(arr[i : i + 5], weights=weights)
        return result

    # Apply the weighted_average function to the DataFrame
    return pd.DataFrame(
        np.apply_along_axis(weighted_average, axis=1, arr=df.values, weights=weights),
        columns=df.columns[4:],  # Adjust the columns to match the rolling window size
    )


# warm-up numba
rolling_numba(df)
rolling_safffh(df)


t1 = timeit("rolling_normal(x)", setup="x=df.copy()", number=1, globals=globals())
t2 = timeit("rolling_numba(x)", setup="x=df.copy()", number=1, globals=globals())
t3 = timeit("rollin_panda_kim(x)", setup="x=df.copy()", number=1, globals=globals())
t4 = timeit("rolling_safffh(x)", setup="x=df.copy()", number=1, globals=globals())

print(t1)
print(t2)
print(t3)
print(t4)

我的机器上打印的结果（AMD 5700X/Python 3.11）：

21.627028748000157
0.3747533499990823
0.008139017998473719
0.40484421200017096

@PandaKim的解决方案是最快的。

- Andrej Kesely

2

确保您已安装numba，然后在apply()函数中指定engine="numba"和raw=True作为关键字参数。

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(20000, 50))

def weighted_mean(seq):
    weights = [1/9, 2/9, 1/3, 2/9, 1/9]
    return np.average(seq, weights=weights)

rolling_mean = df.rolling(5, axis=1).apply(weighted_mean, engine="numba", raw=True)

在pandas文档页面Enhancing performance中解释了numba的基本原理：

Numba（即时编译）

与静态编译Cython代码的替代方案是使用具有即时动态（JIT）编译器的Numba。

Numba允许您编写纯Python函数，可以通过使用@jit修饰符将其JIT编译为本机机器指令，性能类似于C、C++和Fortran。

Numba通过在导入时、运行时或静态地（使用包含的pycc工具）使用LLVM编译器基础设施生成优化的机器代码。Numba支持将Python编译为在CPU或GPU硬件上运行，并设计用于与Python科学软件堆栈集成。

基准测试

在我的机器上，这个解决方案需要1.087秒。Panda Kim的解决方案更优，只需要0.058秒。

- Xukrao

1

为了加快在大型DataFrame上计算滚动逐行加权平均值的速度，您可以利用Numba。

import numpy as np
import pandas as pd
import numba as nb

# Sample data
np.random.seed(42)
df = pd.DataFrame(np.random.rand(20000, 50))
weights = [1/9, 2/9, 1/3, 2/9, 1/9]

# Define a Numba JIT-compiled function for the weighted average
@nb.njit
def weighted_average(arr, weights):
    n = len(arr)
    result = np.empty(n - 4)  # The size of the resulting rolling window
    for i in range(n - 4):
        result[i] = np.average(arr[i:i+5], weights=weights)
    return result

# Apply the weighted_average function to the DataFrame
rolling_mean = pd.DataFrame(
    np.apply_along_axis(weighted_average, axis=1, arr=df.values, weights=weights),
    columns=df.columns[4:],  # Adjust the columns to match the rolling window size
)

print(rolling_mean)

- Safffh

1

此解决方案的基准测试结果为：0.393秒（在我的设备上）。 - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Panda Kim · Accepted Answer

代码

通过将df乘以weights[0]创建一个新的数据框，然后将df向后移动一位并乘以weights[1]，然后将df向后移动两位并乘以weights[2]，并重复此过程，然后将所有创建的数据框相加，将加快进程速度。

sum([df.shift(num, axis=1) * w for num, w in enumerate(weights)])

它花费了0.05986秒。