有条件地合并 Pandas 数据框中相邻的行

Question

有条件地合并 Pandas 数据框中相邻的行

3

我有一个输入数据框，其内容如下：

NAME    TEXT
Tim     Tim Wagner is a teacher.
Tim     He is from Cleveland, Ohio.
Frank   Frank is a musician.
Tim     He like to travel with his family
Frank   He is a performing artist who plays the cello.
Frank   He performed at the Carnegie Hall last year.
Frank   It was fantastic listening to him.

如果NAME列的连续行具有相同的值，我希望将TEXT列连接在一起。

输出数据框：

NAME    TEXT
Tim     Tim Wagner is a teacher.  He is from Cleveland, Ohio.
Frank   Frank is a musician
Tim     He like to travel with his family
Frank   He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. It was fantastic listening to him.

使用pandas的shift函数，是最好的实现方式吗？感谢任何帮助。

谢谢

- user14262559

2个回答

0

我逐行遍历，然后创建了一个新的DataFrame。


import pandas as pd

df = pd.DataFrame([['Tim', 'Tim Wagner is a teacher.'],
['Tim', 'He is from Cleveland, Ohio.'],
['Frank', 'Frank is a musician'],
['Tim ', 'He likes to travel with his family'],
['Frank', 'He is a performing artist who plays the cello.'],
['Frank', 'He performed at the Carnegie Hall last year'],
['Frank', 'It was fantastic listening to him']], columns=['NAME', 'TEXT'])

col = ""
txt = ""
arr = []
con_ind = 0
for i, row in df.iterrows():
    if col == row['NAME']:
        txt += ' ' + row['TEXT']
    else :
        if (i != 0):
            arr.append([col, txt])
        col = row['NAME']
        txt = row['TEXT']
        
if (txt != row['TEXT']):
    arr.append([col, txt])


print(pd.DataFrame(arr))

- joesph nguyen

谢谢，这个有效。由于数据框很大，我在寻找一种非循环的方法。 - user14262559

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Scott Boston · Accepted Answer

5

请尝试：

grp = (df['Name'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT']\
  .agg(' '.join).reset_index().drop('group', axis=1)

输出：

    NAME                                               TEXT
0    Tim  Tim Wagner is a teacher. He is from Cleveland,...
1  Frank                                Frank is a musician
2   Tim                  He likes to travel with his family
3  Frank  He is a performing artist who plays the cello....

- Scott Boston

1

不确定 cumsum() 在文本列（NAME）上的工作方式。此外，grp 是否是从 df['NAME'].shift().cumsum().rename('group') 分组的数据框。 - user14262559

1

谢谢@scott-boston，这个方法很有效。我有一个后续问题。我还想聚合另外两列（START，END）。新的START将从第一行开始聚合，新的END将从最后一行聚合。例如：姓名文本开始结束 Tim Tim Wagner是一名教师。 10 20.5 Tim 他来自俄亥俄州克利夫兰市。 20.5 40应该聚合为：姓名文本开始结束 Tim Tim Wagner是一名教师。他来自俄亥俄州克利夫兰市。 10 40 - user14262559

1

63886474 是新的。@scott-benson 非常感谢您的帮助。 - user14262559

非常感谢！这段代码解决了我的问题。只需在第二行代码上添加注释即可。@Scott-Boston的原始版本在我的脚本中不起作用，但以下修改可以，在合并时也会删除N/A值：

df = df.groupby([0, grp], sort=False).agg(lambda x: ','.join(x.dropna().astype(str))).reset_index().drop('group', axis=1)

- LearnAWK

1

很棒的答案。省了我很多时间！！ - Jeru Luke

显示剩余3条评论