将Pandas按小时分组并转换为字典

Question

将Pandas按小时分组并转换为字典

5

我有一个pandas数据框，如下：

date                | Item   | count
------------------------------------
2016-12-06 10:45:08 |  Item1 |  60
2016-12-06 10:45:08 |  Item2 |  145
2016-12-06 09:45:00 |  Item1 |  60
2016-12-06 09:44:54 |  Item3 |  600
2016-12-06 09:44:48 |  Item4 |  15
2016-12-06 11:45:08 |  Item1 |  60
2016-12-06 10:45:08 |  Item2 |  14
2016-11-06 09:45:00 |  Item1 |  62
2016-11-06 09:44:54 |  Item3 |  6
2016-11-06 09:44:48 |  Item4 |  15

我正在尝试按照一天中的小时（或者更晚的一天）对项目进行分组，以了解以下统计信息：每天售出的物品列表，例如：

在2016-12-06，从09:00:00到10:00:00，售出了Item1、Item3和Item4等物品；等等。
在2016-12-06，售出了Item1、Item2、Item3、Item4（唯一的物品）等物品。

虽然我离得到这些统计数据还很远，但我现在卡在了按时间分组上。最初，print df.dtypes显示：

date    object
Item    object
count   int64
dtype: object

因此，我使用以下代码将日期列转换为Pandas日期对象。

df['date'] = pd.to_datetime(df['date'])

现在，print df.dtypes 的输出结果为：

date    datetime64[ns]
Item    object
count   int64
dtype: object

然而，当我尝试使用TimeGrouper对date列进行分组时，执行以下代码行：

from pandas.tseries.resample import TimeGrouper 
print df.groupby([df['date'],pd.TimeGrouper(freq='Min')])

我得到了以下的TypeError错误。根据这里或这里提供的建议，使用pd.to_datetime进行转换应该可以解决此问题。

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

我不知道如何解决这个问题，以便继续查找我需要的统计信息。有没有关于解决这个错误并使用TimeGrouper寻找统计信息的提示，最好是以字典格式（或任何更合理的格式）呈现。非常感谢。

- kingmakerking

2个回答

3

sold = df.set_index('date').Item.resample('H').agg({'Sold': 'unique'})
sold[sold.Sold.str.len() > 0]

                                      Sold
date                                      
2016-11-06 09:00:00  [Item4, Item3, Item1]
2016-12-06 09:00:00  [Item4, Item3, Item1]
2016-12-06 10:00:00         [Item1, Item2]
2016-12-06 11:00:00                [Item1]

- piRSquared

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

你可以使用numpy array的groupby，对剔除了分钟和秒钟的日期进行分组：

print (df['date'].values.astype('<M8[h]'))
['2016-12-06T10' '2016-12-06T10' '2016-12-06T09' '2016-12-06T09'
 '2016-12-06T09' '2016-12-06T11' '2016-12-06T10' '2016-11-06T09'
 '2016-11-06T09' '2016-11-06T09']

print (df.groupby(df['date'].values.astype('<M8[h]')).Item.unique())
2016-11-06 09:00:00    [Item1, Item3, Item4]
2016-12-06 09:00:00    [Item1, Item3, Item4]
2016-12-06 10:00:00           [Item1, Item2]
2016-12-06 11:00:00                  [Item1]
Name: Item, dtype: object

print (df.groupby(df['date'].values.astype('<M8[h]')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 09:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 09:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item2'], 
 Timestamp('2016-12-06 11:00:00'): ['Item1']}

print (df.groupby(df['date'].values.astype('<M8[D]')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']}

感谢 Jeff 建议使用 round：

print (df.groupby(df['date'].dt.round('h')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())

{Timestamp('2016-11-06 10:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 12:00:00'): ['Item1'], 
 Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 11:00:00'): ['Item1', 'Item2']}

print (df.groupby(df['date'].dt.round('d')).Item
         .apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'], 
 Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']}