从DataFrame列表初始化Pandas Series速度慢

Question

从DataFrame列表初始化Pandas Series速度慢

5

如果我们从一个DataFrame列表中初始化pandas Series对象，我发现速度非常慢。例如下面的代码：

import pandas as pd
import numpy as np

# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]

# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)

起初我以为Series初始化会意外地深度复制数据框，导致速度变慢，但事实证明它只是像Python中通常使用的=一样进行引用复制。

另一方面，如果我只是创建一个系列并手动浅层复制元素（在for循环中），那么速度就会很快：

# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), dtype=object)
for i in range(1000):
    s1[i] = l[i]

这里发生了什么？

实际应用场景：我有一个读取磁盘上某些内容并返回 pandas DataFrame（即表格）的表加载器。为加快读取速度，我使用了一个并行工具（来自这个答案），执行多次读取（每次读取例如一个日期），并返回一个列表（包含表格）。现在我想将此列表转换为具有适当索引的 pandas Series 对象（例如，在读取中使用的日期或文件位置），但 Series 的构建需要极长的时间（如上面示例代码所示）。我当然可以写一个 for 循环来解决问题，但那会很丑陋。此外，我想知道这里真正耗费的时间是什么。有任何见解吗？

- Fei Liu

我无法重现 pd.Series(l)，因为在 Google Colab 上使用 i in range(600) 时需要太长时间，这是由于 RAM 有限。也许你的机器开始使用交换内存了？ - hilberts_drinking_problem

@hilberts_drinking_problem：确实很奇怪。我在Google Colab上也尝试了一下，发现与你的结果一致：pd.Series(l)似乎没有问题。我也怀疑这可能与pandas版本有关（我的是1.3.5，而我尝试的Google Colab运行时使用的是1.1.5）。然而，即使我切换到1.1.5（Google Colab使用的版本），问题仍然存在（在我的MacBook Pro上）。这太奇怪了。此外，我不认为这与交换内存有关：即使我将其缩小到i in range(500)，我仍然看到它发生（而且内存充足）。 - Fei Liu

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- SultanOrazbayev · Answer 1

这不是对OP问题的直接回答（即从数据帧列表构建系列时导致减速的原因）：

我可能忽略了使用pd.Series存储数据帧列表的重要优势，但如果这对下游流程不是关键，则更好的选择可能是将其存储为数据帧字典或连接成单个数据帧。

对于数据帧字典，可以使用以下内容：

d = {n: df for n, df in enumerate(l)}
# can change the key to something more useful in downstream processes

字符串连接：

w = pd.concat(l, axis=1)
# note that when using with the snippet in this question
# the column names will be duplicated (because they have
# the same names) but if your actual list of dataframes
# contains unique column names, then the concatenated
# dataframe will act as a normal dataframe with unique
# column names