从pandas数据框中删除特定日期的最快方法

Question

从pandas数据框中删除特定日期的最快方法

pythondatetimepandasindexingdata-science

3

我正在处理一个大型数据集，但是我很难找到一种高效的方法来删除特定日期的任何测量值。请注意，我要删除的是来自特定日期的所有测量值。

Pandas有一个很棒的函数，你可以这样调用：

df.ix['2016-04-22']

并且获取当天的所有行。但是，如果我想要删除所有'2016-04-22'的行怎么办？

我希望有这样一个函数：

df.ix[~'2016-04-22']

但是那样做行不通。

另外，如果我想删除一系列日期怎么办？

目前，我的解决方案如下：

import numpy as np
import pandas as pd
from numpy import random

###Create a sample data frame

dates = [pd.Timestamp('2016-04-25 06:48:33'), pd.Timestamp('2016-04-27 15:33:23'), pd.Timestamp('2016-04-23 11:23:41'), pd.Timestamp('2016-04-28    12:08:20'), pd.Timestamp('2016-04-21 15:03:49'), pd.Timestamp('2016-04-23 08:13:42'), pd.Timestamp('2016-04-27 21:18:22'), pd.Timestamp('2016-04-27 18:08:23'), pd.Timestamp('2016-04-27 20:48:22'), pd.Timestamp('2016-04-23 14:08:41'), pd.Timestamp('2016-04-27 02:53:26'), pd.Timestamp('2016-04-25 21:48:31'), pd.Timestamp('2016-04-22 12:13:47'), pd.Timestamp('2016-04-27 01:58:26'), pd.Timestamp('2016-04-24 11:48:37'), pd.Timestamp('2016-04-22 08:38:46'), pd.Timestamp('2016-04-26 13:58:28'), pd.Timestamp('2016-04-24 15:23:36'), pd.Timestamp('2016-04-22 07:53:46'), pd.Timestamp('2016-04-27 23:13:22')]

values = random.normal(20, 20, 20)

df = pd.DataFrame(index=dates, data=values, columns ['values']).sort_index()

### This is the list of dates I want to remove

removelist = ['2016-04-22', '2016-04-24']

这个for循环基本上获取了我想要删除的日期的索引，然后将其从主数据帧的索引中删除，最后从数据帧中积极地选择剩余的日期（即好的日期）。

for r in removelist:
    elimlist = df.ix[r].index.tolist()
    ind = df.index.tolist()
    culind = [i for i in ind if i not in elimlist]
    df = df.ix[culind]

有没有更好的东西呢？

我也尝试过按照圆整日期加一天进行索引，类似于这样：

df[~((df['Timestamp'] < r+pd.Timedelta("1 day")) & (df['Timestamp'] > r))]

但这样做非常麻烦，而且（最终）当我需要消除n个特定日期时，仍然需要使用for循环。

一定有更好的方法！对吧？也许？

- Reid

2个回答

3

与 @Alexander 相同的想法，但使用了 DatetimeIndex 和 numpy.in1d 的属性：

mask = ~np.in1d(df.index.date, pd.to_datetime(removelist).date)
df = df.loc[mask, :]

时间：

%timeit df.loc[~np.in1d(df.index.date, pd.to_datetime(removelist).date), :]
1000 loops, best of 3: 1.42 ms per loop

%timeit df[[d.date() not in pd.to_datetime(removelist) for d in df.index]]
100 loops, best of 3: 3.25 ms per loop

- root

太棒了！完美地运作了！非常感谢您的回复！ - Reid

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alexander · Accepted Answer

你可以使用列表推导式创建一个布尔掩码。

>>> df[[d.date() not in pd.to_datetime(removelist) for d in df.index]]
                        values
2016-04-21 15:03:49  28.059520
2016-04-23 08:13:42 -22.376577
2016-04-23 11:23:41  40.350252
2016-04-23 14:08:41  14.557856
2016-04-25 06:48:33  -0.271976
2016-04-25 21:48:31  20.156240
2016-04-26 13:58:28  -3.225795
2016-04-27 01:58:26  51.991293
2016-04-27 02:53:26  -0.867753
2016-04-27 15:33:23  31.585201
2016-04-27 18:08:23  11.639641
2016-04-27 20:48:22  42.968156
2016-04-27 21:18:22  27.335995
2016-04-27 23:13:22  13.120088
2016-04-28 12:08:20  53.730511