使用Python Pandas计算时间差并将其打印到CSV文件中

3
  completed             deadline
15-07-2013 23:10    15-07-2013 23:15
16-07-2013 00:20    16-07-2013 00:15
16-07-2013 00:20    16-07-2013 00:15
16-07-2013 21:04    16-07-2013 21:30
16-07-2013 21:58    16-07-2013 22:00
16-07-2013 23:21    16-07-2013 23:15
16-07-2013 23:21    16-07-2013 23:15
17-07-2013 00:19    17-07-2013 00:15
17-07-2013 00:19    17-07-2013 00:15
17-07-2013 21:18    17-07-2013 21:30
17-07-2013 22:07    17-07-2013 22:00

当我写 data['completed'] - data['deadline'] 时,会得到以下结果:
-1 day, 23:55:00 # on time
         0:05:00
         0:05:00
-1 day, 23:34:00 # on time
-1 day, 23:58:00 # on time
         0:06:00
         0:06:00
         0:04:00
         0:04:00
-1 day, 23:48:00 # on time
         0:07:00

但是当我执行 data['time_delay'] = data['completed'] - data['deadline'] 并打印 data['time_delay'] 我得到的结果是:

-300000000000
300000000000
300000000000
-1560000000000
-120000000000
360000000000
360000000000
240000000000
240000000000
-720000000000
420000000000

当输出被打印到csv时,我得到了相同的结果。

我该如何:

  1. 处理这个输出?

  2. 以“分钟”格式将输出打印到csv中?

  3. 处理“准时”输出?

2个回答

2
>>> data = pd.read_csv('1.csv', parse_dates=[0,1])
>>> data['time_delay'] = data['completed'] - data['deadline']
>>> print data['time_delay']
0   -00:05:00
1    00:05:00
2    00:05:00
3   -00:26:00
4   -00:02:00
Name: time_delay, dtype: timedelta64[ns]
>>> data.to_csv(sys.stdout)
,completed,deadline,time_delay
0,2013-07-15 23:10:00,2013-07-15 23:15:00,-300000000000
1,2013-07-16 00:20:00,2013-07-16 00:15:00,300000000000
2,2013-07-16 00:20:00,2013-07-16 00:15:00,300000000000
3,2013-07-16 21:04:00,2013-07-16 21:30:00,-1560000000000
4,2013-07-16 21:58:00,2013-07-16 22:00:00,-120000000000
>>> data['time_delay'] = data['time_delay'].apply(pd.lib.repr_timedelta64)
>>> data.to_csv(sys.stdout)
,completed,deadline,time_delay
0,2013-07-15 23:10:00,2013-07-15 23:15:00,-00:05:00
1,2013-07-16 00:20:00,2013-07-16 00:15:00,00:05:00
2,2013-07-16 00:20:00,2013-07-16 00:15:00,00:05:00
3,2013-07-16 21:04:00,2013-07-16 21:30:00,-00:26:00
4,2013-07-16 21:58:00,2013-07-16 22:00:00,-00:02:00

pandas.lib.repr_timedelta64 不是没有文档的。因此,这段代码在未来可能会出现问题。(我使用了 pandas 0.11.0)


FYI,反向操作尚未实现(读取时间差列);该列将被读取为对象数据类型。 - Jeff

1

试试这个:

def func(x,y):
  if x > y: 
    return 'delayed by ' + str( ((x-y).seconds//60)%60) + ' minutes'
  else:
    return 'on time by ' + str( ((y-x).seconds//60)%60) + ' minutes'


  data["ontime"] = data.apply(lambda row: func(row["completed"], row["deadline"]), axis=1)

This gives:

    completed                   deadline              ontime
0   2013-07-15 23:10:00    2013-07-15 23:15:00     on time by 5 minutes
1   2013-07-16 00:20:00    2013-07-16 00:15:00     delayed by 5 minutes
2   2013-07-16 00:20:00    2013-07-16 00:15:00     delayed by 5 minutes
3   2013-07-16 21:04:00    2013-07-16 21:30:00     on time by 26 minutes
4   2013-07-16 21:58:00    2013-07-16 22:00:00     on time by 2 minutes
5   2013-07-16 23:21:00    2013-07-16 23:15:00     delayed by 6 minutes
6   2013-07-16 23:21:00    2013-07-16 23:15:00     delayed by 6 minutes
7   2013-07-17 00:19:00    2013-07-17 00:15:00     delayed by 4 minutes
8   2013-07-17 00:19:00    2013-07-17 00:15:00     delayed by 4 minutes
9   2013-07-17 21:18:00    2013-07-17 21:30:00     on time by 12 minutes
10  2013-07-17 22:07:00    2013-07-17 22:00:00     delayed by 7 minutes

没有看到被接受的答案,它提供了更好的解决方案。 - nom-mon-ir
不错的答案,但是当我尝试你的代码时,出现了以下错误:"unsupported operand type(s) for -: 'str' and 'str'", u'occurred at index 0' - richie
1
我尝试过这个,并且它可以工作。 data["ontime"] = data.apply(lambda row: func(pd.Timestamp(row["completed"]), pd.Timestamp(row["deadline"])), axis=1) - richie
啊,刚看到你的评论。我不得不在 pd.read_csv 之后使用 data['completed'] = pd.to_datetime(data['completed']),这样就将 str 值转换为 datetime 格式了,所以在我的情况下,我没有遇到你遇到的 str vs datetime 错误。 - nom-mon-ir

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接