为什么Swift比香草df.apply更慢？

Question

为什么Swift比香草df.apply更慢？

4

我有一个包含1百万行的数据帧。我有一个单独的函数（无法向量化），需要对每一行应用该函数。我尝试使用Swifter来加速计算，它承诺利用多个进程来提高速度。但在一台8核机器上，表现并不理想。

有任何想法是为什么吗？

def parse_row(n_print=None):
    def f(row):
        if n_print is not None and row.name % n_print == 0:
            print(row.name, end="\r")
        return Feature(
            geometry=Point((float(row["longitude"]), float(row["latitude"]))),
            properties={
                "water_level": float(row["water_level"]),
                "return_period": float(row["return_period"])
            }
        )
    return f

In [12]: df["feature"] = df.swifter.apply(parse_row(), axis=1)
Dask Apply: 100%|████████████████████████████████████████| 48/48 [01:19<00:00,  1.65s/it]

In [13]: t = time(); df["feature"] = df.apply(parse_row(), axis=1); print(int(time() - t))
46

- ted

3

看起来速度取决于行大小。当size(df)<10**8时，df.swifter.apply(lambda x: 1 if x>5 else 0) 的速度比简单的apply慢。尝试使用pandarallel，它似乎效果很好。https://github.com/nalepae/pandarallel - notilas

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- msarafzadeh · Accepted Answer

主要取决于所涉及的处理能力，以及矢量化/并行处理/优化是否可以改善问题。有时候，这并不是一个解决方案。还要记住，Swifter需要时间来计算其预计的工作时间跨度，有时候df.apply会更快，因为它不必计算这个，并且优化可能也没有帮助。