如何使用pandas处理时间序列数据中的重复时间？

Question

如何使用pandas处理时间序列数据中的重复时间？

4

我从API调用中获得以下内容，作为更大数据集的一部分：

{'Time': datetime.datetime(2017年5月21日18:18:1, tzinfo=tzutc()), 'Price': '0.052600'} {'Time': datetime.datetime(2017年5月21日18:18:1, tzinfo=tzutc()), 'Price': '0.052500'}

理想情况下，我希望将时间戳作为pandas数据帧上的索引。但是，在转换为JSON时会出现重复，导致失败。

df = df.set_index(pd.to_datetime(df['Timestamp']))
print(new_df.to_json(orient='index'))

ValueError: DataFrame索引在orient='index'的情况下必须是唯一的。有关如何处理这种情况的任何指导吗？丢弃一个数据点？时间没有更细粒度，只到秒，而在那一秒钟内显然有价格变化。

- user7186882

你需要告诉我们如何处理同时发生的多个价格事件：保留第一个、最后一个或全部？保留第一个价格？平均价格？最高和最低价格？……？这取决于你最终对数据要做什么。你需要提供更多上下文信息。 - smci

3个回答

0

仅仅是为了扩展被接受的答案：添加一个循环有助于处理第一次遇到的任何新重复项。

这个isnull很重要，可以捕获数据中的任何NaT。因为任何时间差 + NaT仍然是NaT。

def deduplicate_start_times(frame, col='start_time', max_iterations=10):
    """
    Fuzz duplicate start times from a frame so we can stack and unstack
    this frame.
    """

    for _ in range(max_iterations):
        dups = frame.duplicated(subset=col) & ~pandas.isnull(frame[col])

        if not dups.any():
            break

        LOGGER.debug("Removing %i duplicates", dups.sum())

        # Add several ms of time to each time
        frame[col] += pandas.to_timedelta(frame.groupby(col).cumcount(),
                                          unit='ms')

    else:
        LOGGER.error("Exceeded max iterations removing duplicates. "
                     "%i duplicates remain", dups.sum())

    return frame

- Danielle Madeley

0

你可以使用 .duplicated 保留第一条或最后一条记录。请查看 pandas.DataFrame.duplicated。

- ardms

1

不行，因为价格已经改变了，而且降低了。 - jezrael

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

我认为您可以通过使用cumcount和to_timedelta来添加ms以更改重复的日期时间：

d = [{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052600'},
     {'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052500'}]
df = pd.DataFrame(d)
print (df)
      Price                Time
0  0.052600 2017-05-21 18:18:01
1  0.052500 2017-05-21 18:18:01

print (pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms'))
0          00:00:00
1   00:00:00.001000
dtype: timedelta64[ns]

df['Time'] = df['Time'] + pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms')
print (df)
      Price                    Time
0  0.052600 2017-05-21 18:18:01.000
1  0.052500 2017-05-21 18:18:01.001

new_df = df.set_index('Time')
print(new_df.to_json(orient='index'))
{"1495390681000":{"Price":"0.052600"},"1495390681001":{"Price":"0.052500"}}