为 pandas DataFrame 计算活动间隔。

Question

为 pandas DataFrame 计算活动间隔。

3

在Python中，假设有以下pandas的DataFrame数据：

date                  ID_bulb  switch   using_time          error
2022-02-05 14:30:21+00:00   5   OFF       NaT               INIT
2022-02-27 15:30:21+00:00   5   ON        NaT               ERROR2
2022-02-27 17:05:21+00:00   5   OFF   0 days 01:35:00       OK
2022-04-07 15:30:21+00:00   5   ON        NaT               OK
2022-04-07 15:30:21+00:00   5   OFF   0 days 00:00:00       OK
2022-04-07 17:05:21+00:00   5   OFF       NaT               ERROR2
2022-04-06 15:30:21+00:00   4   ON        NaT               INIT
2022-04-06 15:35:21+00:00   4   OFF       NaT               ERROR1
2022-04-06 16:10:21+00:00   4   ON        NaT               ERROR2
2022-04-07 15:30:21+00:00   4   OFF   0 days 23:20:00       OK
2022-04-07 17:05:21+00:00   4   ON        NaT               ERROR2
2022-01-01 19:40:21+00:00   3   ON        NaT               INIT
2022-02-03 22:40:21+00:00   3   ON        NaT               ERROR2
2022-02-03 23:20:21+00:00   3   OFF   0 days 00:40:00       OK
2022-02-04 00:20:21+00:00   3   ON        NaT               OK
2022-02-04 14:30:21+00:00   3   ON        NaT               ERROR2
2022-02-04 15:30:21+00:00   3   ON        NaT               ERROR2
2022-02-04 15:35:21+00:00   3   OFF   0 days 00:05:00       OK
2022-02-04 15:40:21+00:00   3   OFF       NaT               ERROR2
2022-02-04 19:40:21+00:00   3   ON        NaT               OK
2022-02-06 15:35:21+00:00   3   OFF   1 days 19:55:00       OK
2022-02-28 18:40:21+00:00   3   ON        NaT               ERROR1
2022-10-12 18:40:21+00:00   3   OFF   226 days 00:00:00     OK
2022-02-04 09:10:21+00:00   2   ON        NaT               OK
2022-02-04 14:10:21+00:00   2   ON        NaT               ERROR2

需要解决的问题: 我想添加一个名为cost_days的新列。该列只包括变量using_time与NaT不同的行。在start_time到end_time定义的夜间周期中，至少连续n小时亮着的灯泡次数信息。

基于@keramat的想法提出的解决方案.

def rounder(x):
      # Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
      n = 5
      start_date = "22:00"
      end_date = "07:00"
      # assert (n+1) < time_slot
      time_1 = datetime.strptime(start_date,"%H:%M")
      time_2 = datetime.strptime(end_date,"%H:%M")
      time_slot = (time_2 - time_1).seconds // 3600
      v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
      temp = pd.Series(v, index = v).between_time(start_date, end_date)
      temp = len(temp)/time_slot
      return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot

    g = (df['using_time'].notna()).sort_index(ascending=False).cumsum()
    g = (g-max(g)).abs()

    temp = df.groupby(g)['date'].apply(lambda x: rounder(x))
    
    #Up to this point, it runs perfectly.
    
    
    df.loc[df[df['using_time'].notna()].index, 'cost_days']=temp.values
    
    # ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)
 
    df['cost_days'] = df['cost_days'].fillna(0)
    
    print(df)

我需要修复错误以获得想要的结果。

生成的数据框应如下所示：

date                  ID_bulb  switch   using_time          error    cost_days
2022-02-05 14:30:21+00:00   5   OFF       NaT               INIT         0
2022-02-27 15:30:21+00:00   5   ON        NaT               ERROR2       0
2022-02-27 17:05:21+00:00   5   OFF   0 days 01:35:00       OK           0
2022-04-07 15:30:21+00:00   5   ON        NaT               OK           0
2022-04-07 15:30:21+00:00   5   OFF   0 days 00:00:00       OK           0
2022-04-07 17:05:21+00:00   5   OFF       NaT               ERROR2       0
2022-04-06 15:30:21+00:00   4   ON        NaT               INIT         0
2022-04-06 15:35:21+00:00   4   OFF       NaT               ERROR1       0
2022-04-06 16:10:21+00:00   4   ON        NaT               ERROR2       0
2022-04-07 15:30:21+00:00   4   OFF   0 days 23:20:00       OK           1
2022-04-07 17:05:21+00:00   4   ON        NaT               ERROR2       0
2022-01-01 19:40:21+00:00   3   ON        NaT               INIT         0
2022-02-03 22:40:21+00:00   3   ON        NaT               ERROR2       0
2022-02-03 23:20:21+00:00   3   OFF   0 days 00:40:00       OK           0
2022-02-04 00:20:21+00:00   3   ON        NaT               OK           0
2022-02-04 14:30:21+00:00   3   ON        NaT               ERROR2       0
2022-02-04 15:30:21+00:00   3   ON        NaT               ERROR2       0
2022-02-04 15:35:21+00:00   3   OFF   0 days 00:05:00       OK           0
2022-02-04 15:40:21+00:00   3   OFF       NaT               ERROR2       0
2022-02-04 19:40:21+00:00   3   ON        NaT               OK           0
2022-02-06 15:35:21+00:00   3   OFF   1 days 19:55:00       OK           2
2022-02-28 18:40:21+00:00   3   ON        NaT               ERROR1       0
2022-10-12 18:40:21+00:00   3   OFF   226 days 00:00:00     OK           226
2022-02-04 09:10:21+00:00   2   ON        NaT               OK           0
2022-02-04 14:10:21+00:00   2   ON        NaT               ERROR2       0

编辑: 我认为问题在于输入数据集必须以非空的using_time值结束，以上代码才能正常工作。如何通过修复这个问题来获得我想要的性能？

- Carola

你好，能否展示一下最终的数据框应该是什么样子？ - Laurent

ID_bulb 也不应该有影响吗？例如，如果您有一个序列：[(1, 'ON'), (2, 'ON'), (1, 'OFF')]，我想 'OFF' 应该与 (1, 'ON') 行匹配，对吗？ - Pierre D

'ERROR1'和'ERROR2'是什么意思？例如，在2022-02-04 15:35:21，used_time为05:00，这是与上一行的时间差，尽管该行具有'ERROR2'条件。在稍微上面有一个('ON', 'OK')行，时间是2022-02-04 00:20:21；难道它不应该是时间差的起点吗？只是检查所有逻辑。 - Pierre D

@PierreD ID_bulb 很重要，但默认情况下它已经排序。error 属性对此函数的操作不重要。 - Carola

3个回答

1

使用：

if df['using_time'].iloc[-1]=='NaT':
    g = g[g!=g.max()]

请注意，由于我使用了您的字符串创建了df，因此使用时间是字符串类型。在您的情况下，可能需要执行以下操作：

if df['using_time'].iloc[-1].isna():
    g = g[g!=g.max()]

演示：

string = """date   ID_bulb   switch   using_time   error
2022-02-05 14:30:21+00:00   5   OFF   NaT   INIT
2022-02-27 15:30:21+00:00   5   ON   NaT   ERROR2
2022-02-27 17:05:21+00:00   5   OFF   0 days 01:35:00   OK
2022-04-07 15:30:21+00:00   5   ON   NaT   OK
2022-04-07 15:30:21+00:00   5   OFF   0 days 00:00:00   OK
2022-04-07 17:05:21+00:00   5   OFF   NaT   ERROR2
2022-04-06 15:30:21+00:00   4   ON   NaT   INIT
2022-04-06 15:35:21+00:00   4   OFF   NaT   ERROR1
2022-04-06 16:10:21+00:00   4   ON   NaT   ERROR2
2022-04-07 15:30:21+00:00   4   OFF   0 days 23:20:00   OK
2022-04-07 17:05:21+00:00   4   ON   NaT   ERROR2
2022-01-01 19:40:21+00:00   3   ON   NaT   INIT
2022-02-03 22:40:21+00:00   3   ON   NaT   ERROR2
2022-02-03 23:20:21+00:00   3   OFF   0 days 00:40:00   OK
2022-02-04 00:20:21+00:00   3   ON   NaT   OK
2022-02-04 14:30:21+00:00   3   ON   NaT   ERROR2
2022-02-04 15:30:21+00:00   3   ON   NaT   ERROR2
2022-02-04 15:35:21+00:00   3   OFF   0 days 00:05:00   OK
2022-02-04 15:40:21+00:00   3   OFF   NaT   ERROR2
2022-02-04 19:40:21+00:00   3   ON   NaT   OK
2022-02-06 15:35:21+00:00   3   OFF   1 days 19:55:00   OK
2022-02-28 18:40:21+00:00   3   ON   NaT   ERROR1
2022-10-12 18:40:21+00:00   3   OFF   226 days 00:00:00   OK
2022-02-04 09:10:21+00:00   2   ON   NaT   OK
2022-02-04 14:10:21+00:00   2   ON   NaT   ERROR2"""

data = [x.split('   ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])

from datetime import datetime
import numpy as np

def rounder(x):
  # Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
  n = 5
  start_date = "22:00"
  end_date = "07:00"
  # assert (n+1) < time_slot
  time_1 = datetime.strptime(start_date,"%H:%M")
  time_2 = datetime.strptime(end_date,"%H:%M")
  time_slot = (time_2 - time_1).seconds // 3600
  v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
  temp = pd.Series(v, index = v).between_time(start_date, end_date)
  temp = len(temp)/time_slot
  return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot

g = (df['using_time']!='NaT').sort_index(ascending=False).cumsum()
g = (g-max(g)).abs()
if df['using_time'].iloc[-1]=='NaT':
    g = g[g!=g.max()]
temp = df.groupby(g)['date'].apply(lambda x: rounder(x))  

df.loc[df[df['using_time']!='NaT'].index, 'cost_days']=temp.values

# ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)

df['cost_days'] = df['cost_days'].fillna(0)

df

输出：

date    ID_bulb switch  using_time  error   cost_days
0   2022-02-05 14:30:21+00:00   5   OFF NaT INIT    0.0
1   2022-02-27 15:30:21+00:00   5   ON  NaT ERROR2  0.0
2   2022-02-27 17:05:21+00:00   5   OFF 0 days 01:35:00 OK  0.0
3   2022-04-07 15:30:21+00:00   5   ON  NaT OK  0.0
4   2022-04-07 15:30:21+00:00   5   OFF 0 days 00:00:00 OK  0.0
5   2022-04-07 17:05:21+00:00   5   OFF NaT ERROR2  0.0
6   2022-04-06 15:30:21+00:00   4   ON  NaT INIT    0.0
7   2022-04-06 15:35:21+00:00   4   OFF NaT ERROR1  0.0
8   2022-04-06 16:10:21+00:00   4   ON  NaT ERROR2  0.0
9   2022-04-07 15:30:21+00:00   4   OFF 0 days 23:20:00 OK  1.0
10  2022-04-07 17:05:21+00:00   4   ON  NaT ERROR2  0.0
11  2022-01-01 19:40:21+00:00   3   ON  NaT INIT    0.0
12  2022-02-03 22:40:21+00:00   3   ON  NaT ERROR2  0.0
13  2022-02-03 23:20:21+00:00   3   OFF 0 days 00:40:00 OK  0.0
14  2022-02-04 00:20:21+00:00   3   ON  NaT OK  0.0
15  2022-02-04 14:30:21+00:00   3   ON  NaT ERROR2  0.0
16  2022-02-04 15:30:21+00:00   3   ON  NaT ERROR2  0.0
17  2022-02-04 15:35:21+00:00   3   OFF 0 days 00:05:00 OK  0.0
18  2022-02-04 15:40:21+00:00   3   OFF NaT ERROR2  0.0
19  2022-02-04 19:40:21+00:00   3   ON  NaT OK  0.0
20  2022-02-06 15:35:21+00:00   3   OFF 1 days 19:55:00 OK  2.0
21  2022-02-28 18:40:21+00:00   3   ON  NaT ERROR1  0.0
22  2022-10-12 18:40:21+00:00   3   OFF 226 days 00:00:00   OK  226.0
23  2022-02-04 09:10:21+00:00   2   ON  NaT OK  0.0
24  2022-02-04 14:10:21+00:00   2   ON  NaT ERROR2  0.0

- keramat

1

也许你可以使用transform代替apply以保持与输入相同的形状。然后，使用where仅选择using_time不为空的行：

# temp = df.groupby(g)['date'].apply(lambda x: rounder(x))  
df['cost_day'] = (df.groupby(g)['date'].transform(lambda x: rounder(x))
                    .where(df['using_time'].notna(), other=0).astype(int))

输出：

>>> df
                        date  ID_bulb switch        using_time   error  cost_day
0  2022-02-05 14:30:21+00:00        5    OFF               NaT    INIT         0
1  2022-02-27 15:30:21+00:00        5     ON               NaT  ERROR2         0
2  2022-02-27 17:05:21+00:00        5    OFF   0 days 01:35:00      OK         0
3  2022-04-07 15:30:21+00:00        5     ON               NaT      OK         0
4  2022-04-07 15:30:21+00:00        5    OFF   0 days 00:00:00      OK         0
5  2022-04-07 17:05:21+00:00        5    OFF               NaT  ERROR2         0
6  2022-04-06 15:30:21+00:00        4     ON               NaT    INIT         0
7  2022-04-06 15:35:21+00:00        4    OFF               NaT  ERROR1         0
8  2022-04-06 16:10:21+00:00        4     ON               NaT  ERROR2         0
9  2022-04-07 15:30:21+00:00        4    OFF   0 days 23:20:00      OK         1
10 2022-04-07 17:05:21+00:00        4     ON               NaT  ERROR2         0
11 2022-01-01 19:40:21+00:00        3     ON               NaT    INIT         0
12 2022-02-03 22:40:21+00:00        3     ON               NaT  ERROR2         0
13 2022-02-03 23:20:21+00:00        3    OFF   0 days 00:40:00      OK         0
14 2022-02-04 00:20:21+00:00        3     ON               NaT      OK         0
15 2022-02-04 14:30:21+00:00        3     ON               NaT  ERROR2         0
16 2022-02-04 15:30:21+00:00        3     ON               NaT  ERROR2         0
17 2022-02-04 15:35:21+00:00        3    OFF   0 days 00:05:00      OK         0
18 2022-02-04 15:40:21+00:00        3    OFF               NaT  ERROR2         0
19 2022-02-04 19:40:21+00:00        3     ON               NaT      OK         0
20 2022-02-06 15:35:21+00:00        3    OFF   1 days 19:55:00      OK         2
21 2022-02-28 18:40:21+00:00        3     ON               NaT  ERROR1         0
22 2022-10-12 18:40:21+00:00        3    OFF 226 days 00:00:00      OK       226
23 2022-02-04 09:10:21+00:00        2     ON               NaT      OK         0
24 2022-02-04 14:10:21+00:00        2     ON               NaT  ERROR2         0

- Corralien

@Carola。我的答案有什么问题吗？ - Corralien

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cottontail · Accepted Answer

将used_time添加到日期中，在原始日期上强制将7:00和22:00之外的时间转换为22:00，并计算时间差。

n = 5
start_date = "22:00"
end_date = "07:00"

dates = pd.to_datetime(df['date'])
df['using_time'] = df['using_time'].replace('NaT', np.nan)
# replace NaT to 0 and subtract using_time from dates
starting_times = dates - pd.to_timedelta(df['using_time'].fillna(0))


# boolean mask that flags the times outside the hours of interest from starting times
hours_of_interest = starting_times.dt.time.between((pd.Timestamp(end_date) - pd.to_timedelta(n, unit='H')).time(), 
                                                   pd.Timestamp(start_date).time(), inclusive='neither')
# replace any time outside the hours of interest to start_date
starting_times = starting_times.mask(hours_of_interest, starting_times.dt.strftime(f"%Y-%m-%d {start_date}:00"))


# find the difference between the dates
diff = dates - starting_times
# separate the difference into days and hours and 1 to the number of days if the number of hours is greater than n 0 otherwise
cost_days = diff.dt.days + (diff - pd.to_timedelta(diff.dt.days, unit='D') >= pd.to_timedelta(n, unit='H'))
# replace cost_days to zero if using_time is NaT
df['cost_days'] = cost_days.mask(df['using_time'].isna(), 0)
df