Python Pandas数据帧特定交替值之间的时间差

3
我有一个包含4列的应用使用数据框,看起来像这样:
Id  Timestamp                App_Name   Event_Type
1   2018/01/16 06:01:05     Instagram   Opened
2   2018/01/16 06:01:06     Instagram   Closed
3   2018/01/16 06:01:07     Instagram   Opened
4   2018/01/16 06:01:08     Instagram   Interaction
5   2018/01/16 06:01:09     Instagram   Interaction
6   2018/01/16 06:02:08     Instagram   Closed
7   2018/01/16 06:01:08     Instagram   Opened
8   2018/01/16 06:01:08     Instagram   Opened
9   2018/01/16 06:01:09     Instagram   Opened
10  2018/01/16 06:01:09     Instagram   Closed
11  2018/01/16 06:03:44     Instagram   Opened
12  2018/01/16 06:03:44     Instagram   Closed
13  2018/01/16 06:03:45     Instagram   Closed
14  2018/01/16 06:03:45     Instagram   Closed
15  2018/01/16 06:03:47     Instagram   Opened

我想要获取每对“Opened”和“Closed”行之间的时间差,无论它们之间是否有其他“Event_Types”。可能会出现多个连续打开或关闭的错误。我只想要最后一个打开和第一个关闭之间的差异。所以在这种情况下,我想要以下行之间的时间差:
- 第2行和第1行 - 第6行和第3行 - 第10行和第9行 - 第12行和第11行
我该如何做到这一点?
谢谢!

3
请发布预期的输出数据框。 - undefined
3个回答

2
这是另一种更复杂的方法,将记录并排放置,然后减去时间戳列。
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

df = df.set_index('Id')
df['g'] = (df['Event_Type'] == 'Opened').cumsum()

df_open = df.query('Event_Type == "Opened"').groupby('g').head(1)
df_close = df.query('Event_Type == "Closed"').groupby('g').head(1)

df_result = df_open.merge(df_close, on='g', suffixes=('_Opened', '_Closed'))
df_result['Timedelta'] = df_result['Timestamp_Closed'] - df_result['Timestamp_Opened']

df_result

输出:

     Timestamp_Opened App_Name_Opened Event_Type_Opened  g    Timestamp_Closed App_Name_Closed Event_Type_Closed       Timedelta
0 2018-01-16 06:01:05       Instagram            Opened  1 2018-01-16 06:01:06       Instagram            Closed 0 days 00:00:01
1 2018-01-16 06:01:07       Instagram            Opened  2 2018-01-16 06:02:08       Instagram            Closed 0 days 00:01:01
2 2018-01-16 06:01:09       Instagram            Opened  5 2018-01-16 06:01:09       Instagram            Closed 0 days 00:00:00
3 2018-01-16 06:03:44       Instagram            Opened  6 2018-01-16 06:03:44       Instagram            Closed 0 days 00:00:00

1

尝试:

out, state = [], None
for i, e in zip(df["Id"], df["Event_Type"]):
    if e == "Opened":
        state = i
    elif e == "Closed" and state is not None:
        out.append([state, i])
        state = None

print(out)

输出:

[[1, 2], [3, 6], [9, 10], [11, 12]]

获取时间差异:
df["Timestamp"] = pd.to_datetime(df["Timestamp"])

out, state = [], None
for i, e in zip(df.index, df["Event_Type"]):
    if e == "Opened":
        state = i
    elif e == "Closed" and state is not None:
        out.append(df.loc[i, "Timestamp"] - df.loc[state, "Timestamp"])
        state = None

print(out)

输出:

[Timedelta('0 days 00:00:01'), Timedelta('0 days 00:01:01'), Timedelta('0 days 00:00:00'), Timedelta('0 days 00:00:00')]

0

步骤1

将时间戳列转换为日期时间数据类型

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

第二步

创建一个以信号“Closed”结尾的组。将分割后的组命名为grp

cond = df['Event_Type'].eq('Closed')
grp = cond.cumsum() - cond

步骤3

首先,从df中删除包含“Interaction”的行。然后,在grp中计算Timestamp列之间的diff(1)。最后,只保留Event_Type为“Closed”的列,并删除NaN。

(df[df['Event_Type'].ne('Interaction')]
  .groupby(grp)['Timestamp'].diff(1)[cond]
  .dropna())

输出:

1    0 days 00:00:01
5    0 days 00:01:01
9    0 days 00:00:00
11   0 days 00:00:00
Name: Timestamp, dtype: timedelta64[ns]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接