在pandas中创建一个带有递增计数器的列，用于识别重复的集合

Question

在pandas中创建一个带有递增计数器的列，用于识别重复的集合

3

我有一个大的数据框，其中一部分列是相同的 dup_columns = ['id', 'subject','topic', 'lesson', 'time']，而另一些则是独特的 ['timestamps'].

   id    subj   topic lesson  timestamp  time  dup_ind dup_group  time_diff
1   1  math    add     a   timestamp1  45sec   True   1      timestamp1-timestamp2         
2   1  math    add     a   timestamp2  45sec   True   1      timestamp1-timestamp2
3   1  math    add     a   timestamp2  30sec   False   NaN
4   1  math    add     a   timestamp3  15sec   False   NaN
5   1  math    add     b   timestamp1  0sec    True    2     timestamp1-timestamp4
6   1  math    add     b   timestamp4  0sec    True    2     timestamp1-timestamp4
7   1  math    add     b   timestamp1  45sec   True    3     timestamp1-timestamp2
8   1  math    add     b   timestamp2  45sec   True    3     timestamp1-timestamp2

我有一列名为['is_duplicate']，它基于dup_columns确定了重复项。我需要创建另一列['dup_group']，通过分配唯一的重复组值（1,2,3 ...）来唯一标识每个重复的行集。最终，我需要使用dup_group在每个重复组内比较timestamp值（我使用.diff()方法实现此目的）。

以下是我编写的代码：

df2= df1.loc[df1['is_duplicated']==True]
def dup_counter():
    for name, group in df11.groupby(dup_columns):
        df[name, df['dupsetnew']]+=1
    return df['dupsetnew']

df11.groupby(dup_columns).apply(dup_counter)

问题1： 函数给我报错了（我是Python和编程的新手）

为了计算时间戳之间的差异，我有以下代码：

df['time_diff'] = df.loc[df.dup_indicator == 1,'event_time'].diff()

问题2：对于我需要的内容，.diff是正确的方法吗？

- at_ca

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jpp · Answer 1

这是一种方法。请注意，我已将df ['timestamp']更改为整数系列以演示原则，但可以针对datetime对象进行调整。

思路是使用pd.factorize在元组列表上标识组。然后应用前向和后向的groupby.diff以获得所需结果。

df['timestamp'] = [1, 2, 2, 3, 1, 4, 1, 2]

df['dup_group'] = pd.factorize(list(zip(df['id'], df['subj'], df['topic'], 
                                        df['lesson'], df['time'])))[0] + 1

df['time_diff'] = df.groupby('dup_group')['timestamp'].transform(pd.Series.diff)

df['time_diff'] = df['time_diff'].fillna(-df.groupby('dup_group')['timestamp']\
                                            .transform(pd.Series.diff, periods=-1))

#    id  subj topic lesson  timestamp   time  dup_ind  dup_group  time_diff
# 1   1  math   add      a          1  45sec     True          1        1.0
# 2   1  math   add      a          2  45sec     True          1        1.0
# 3   1  math   add      a          2  30sec    False          2        NaN
# 4   1  math   add      a          3  15sec    False          3        NaN
# 5   1  math   add      b          1   0sec     True          4        3.0
# 6   1  math   add      b          4   0sec     True          4        3.0
# 7   1  math   add      b          1  45sec     True          5        1.0
# 8   1  math   add      b          2  45sec     True          5        1.0

源数据

from numpy import nan

df = pd.DataFrame({'dup_group': {1: 1.0, 2: 1.0, 3: nan, 4: nan, 5: 2.0, 6: 2.0, 7: 3.0, 8: 3.0},
 'dup_ind': {1: True, 2: True, 3: False, 4: False, 5: True, 6: True, 7: True, 8: True},
 'id': {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
 'lesson': {1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'b', 6: 'b', 7: 'b', 8: 'b'},
 'subj': {1: 'math', 2: 'math', 3: 'math', 4: 'math', 5: 'math', 6: 'math', 7: 'math', 8: 'math'},
 'time': {1: '45sec', 2: '45sec', 3: '30sec', 4: '15sec', 5: '0sec', 6: '0sec', 7: '45sec', 8: '45sec'},
 'time_diff': {1: 'timestamp1-timestamp2', 2: 'timestamp1-timestamp2', 3: nan, 4: nan, 5: 'timestamp1-timestamp4', 6: 'timestamp1-timestamp4', 7: 'timestamp1-timestamp2', 8: 'timestamp1-timestamp2'},
 'timestamp': {1: 'timestamp1', 2: 'timestamp2', 3: 'timestamp2', 4: 'timestamp3', 5: 'timestamp1', 6: 'timestamp4', 7: 'timestamp1', 8: 'timestamp2'},
 'topic': {1: 'add', 2: 'add', 3: 'add', 4: 'add', 5: 'add', 6: 'add', 7: 'add', 8: 'add'}})