有条件地合并 Pandas 数据框中相邻的行

3

我有一个输入数据框,其内容如下:

NAME    TEXT
Tim     Tim Wagner is a teacher.
Tim     He is from Cleveland, Ohio.
Frank   Frank is a musician.
Tim     He like to travel with his family
Frank   He is a performing artist who plays the cello.
Frank   He performed at the Carnegie Hall last year.
Frank   It was fantastic listening to him.

如果NAME列的连续行具有相同的值,我希望将TEXT列连接在一起。

输出数据框:

NAME    TEXT
Tim     Tim Wagner is a teacher.  He is from Cleveland, Ohio.
Frank   Frank is a musician
Tim     He like to travel with his family
Frank   He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. It was fantastic listening to him.

使用pandas的shift函数,是最好的实现方式吗?感谢任何帮助。

谢谢

2个回答

5

请尝试:

grp = (df['Name'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT']\
  .agg(' '.join).reset_index().drop('group', axis=1)

输出:

    NAME                                               TEXT
0    Tim  Tim Wagner is a teacher. He is from Cleveland,...
1  Frank                                Frank is a musician
2   Tim                  He likes to travel with his family
3  Frank  He is a performing artist who plays the cello....

1
不确定 cumsum() 在文本列(NAME)上的工作方式。此外,grp 是否是从 df['NAME'].shift().cumsum().rename('group') 分组的数据框。 - user14262559
1
谢谢@scott-boston,这个方法很有效。我有一个后续问题。我还想聚合另外两列(START,END)。新的START将从第一行开始聚合,新的END将从最后一行聚合。例如:姓名 文本 开始 结束 Tim Tim Wagner是一名教师。 10 20.5 Tim 他来自俄亥俄州克利夫兰市。 20.5 40应该聚合为:姓名 文本 开始 结束 Tim Tim Wagner是一名教师。 他来自俄亥俄州克利夫兰市。 10 40 - user14262559
1
63886474 是新的。@scott-benson 非常感谢您的帮助。 - user14262559
非常感谢!这段代码解决了我的问题。只需在第二行代码上添加注释即可。@Scott-Boston的原始版本在我的脚本中不起作用,但以下修改可以,在合并时也会删除N/A值:df = df.groupby([0, grp], sort=False).agg(lambda x: ','.join(x.dropna().astype(str))).reset_index().drop('group', axis=1) - LearnAWK
1
很棒的答案。省了我很多时间!! - Jeru Luke
显示剩余3条评论

0

我逐行遍历,然后创建了一个新的DataFrame。


import pandas as pd

df = pd.DataFrame([['Tim', 'Tim Wagner is a teacher.'],
['Tim', 'He is from Cleveland, Ohio.'],
['Frank', 'Frank is a musician'],
['Tim ', 'He likes to travel with his family'],
['Frank', 'He is a performing artist who plays the cello.'],
['Frank', 'He performed at the Carnegie Hall last year'],
['Frank', 'It was fantastic listening to him']], columns=['NAME', 'TEXT'])

col = ""
txt = ""
arr = []
con_ind = 0
for i, row in df.iterrows():
    if col == row['NAME']:
        txt += ' ' + row['TEXT']
    else :
        if (i != 0):
            arr.append([col, txt])
        col = row['NAME']
        txt = row['TEXT']
        
if (txt != row['TEXT']):
    arr.append([col, txt])


print(pd.DataFrame(arr))


谢谢,这个有效。由于数据框很大,我在寻找一种非循环的方法。 - user14262559

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接