Pandas - 内存消耗过大

Question

Pandas - 内存消耗过大

6

从一个包含约1500万行数据（占用约250MB）的pickle文件中加载数据框后，我对其进行了一些搜索操作，并删除了其中的一些行。在这些操作期间，内存使用量急剧上升至5GB或7GB左右，这很烦人，因为会发生交换（我的笔记本只有8GB内存）。

问题在于当操作完成时（即执行下面代码中的最后两行时），这些内存并没有被释放。因此，Python进程仍然占据了高达7GB的内存。

不知道为什么会出现这种情况，请问有什么想法吗？我正在使用Pandas 0.20.3。

下面是一个最小化的示例。实际上，“data”变量将有大约1500万行，但我不知道如何在这里发布它。

import datetime, pandas as pd

data = {'Time':['2013-10-29 00:00:00', '2013-10-29 00:00:08', '2013-11-14 00:00:00'], 'Watts': [0, 48, 0]}
df = pd.DataFrame(data, columns = ['Time', 'Watts'])
# Convert string to datetime
df['Time'] = pd.to_datetime(df['Time'])
# Make column Time as the index of the dataframe
df.index = df['Time']
# Delete the column time
df = df.drop('Time', 1)

# Get the difference in time between two consecutive data points
differences = df.index.to_series().diff()
# Keep only the differences > 60 mins
differences = differences[differences > datetime.timedelta(minutes=60)]
# Get the string of the day of the data points when the data gathering resumed
toRemove = [datetime.datetime.strftime(date, '%Y-%m-%d') for date in differences.index.date]

# Remove data points belonging to the day where the differences was > 60 mins
for dataPoint in toRemove:
    df.drop(df[dataPoint].index, inplace=True)

- RiccB

4

请提供一个最简示例，以便能够复现您的情况。 - zimmerrol

https://stackoverflow.com/help/mcve - Fabian S.

2

我支持@FlashTek的观点。不过，你考虑使用Dask了吗？ - rpanai

我想赞同dask。它专门设计用于处理大数据集。 - tsabsch

修改了原始帖子，希望现在更清晰了。 - RiccB

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ryan Stout · Answer 1

0

您可能希望尝试调用垃圾收集器。gc.collect() 更多信息请参见如何在Python中显式释放内存？

- Ryan Stout

它实际上释放了内存。因此，我的问题的根源在于垃圾收集器不够快，我需要手动调用它来释放内存？ - RiccB

什么释放了什么内存？（我不知道你评论中的“它”是什么）。如果你正在释放内存，你就不会看到7GB的内存消耗。仅仅因为你做了类似 df.drop 的操作，并不意味着内存已经被回收。 - Ryan Stout

抱歉，我的意思是gc.collect()可以释放我的内存。调用该命令后，内存消耗降至约290MB。考虑到变量“data”本身占用了约250MB，这是可以接受的。 - RiccB

那么，您是否需要其他任何东西才能接受答案？ - Ryan Stout

你确认我在第一条评论中提出的疑问吗？另外，对于一个250MB的文件进行操作使用这么多内存是正常的吗？ - RiccB

啊，我明白了。我想我误解了你的第一个问题。很遗憾，我不太了解Python的垃圾回收器何时被调用。过去我见过Pandas占用大量内存。但通常我处理的问题类型直接使用numpy数组，所以我对pandas并不是非常熟悉，无法提供有关如何使pandas不使用太多内存或如何自动启动gc的有用信息。 - Ryan Stout