使用Python和Pandas按数据框进行分组

Question

使用Python和Pandas按数据框进行分组

3

假设我有以下这样的数据框 df：

ID	name_x	st	string
1	xx	us	与主要浣熊不熟悉正在损害他的晋升前景
2	xy	us1	高架桥从高速公路下面穿过，进入一个秘密世界
3	xz	us	他百分之百支持和她一起禁食，直到理解那意味着他不能吃东西
4	xu	us2	其他随机的词语在其他随机的词语前面就创建了一个随机的句子
5	xi	us1	你需要做的就是拿起笔开始写

使用 Python 和 Pandas，对于列 st ，我想计算 name_x 值，并从字符串中提取前 3 个关键词。

例如像这样：

st	name_x_count	top1_word	top2_word	top3_word
us	2	词1	词2	词3
us1	2	词1	词2	词3
us2	1	词1	词2	词3

有没有办法解决这个问题？

- dd99

"top 3关键词"是什么意思？按频率、TFIDF？如果出现并列情况，你会如何决定？ - Celius Stingher

是的，完全正确...按频率排序的前三个单词。如果出现并列情况，我会根据字符串长度进行决定。 - dd99

2个回答

1

首先，我在每个字符串末尾添加了一个空格，因为我们将在分组时合并句子。然后，我通过 st 列对分组后的句子进行了合并。

df['string']=df['string'] + ' ' # we will use sum function. When combining sentences, there should be spaces in between.

dfx=df.groupby('st').agg({'st':'count','string':'sum'}) #groupby st and combine strings

然后列出字符串表达式中的每个单词，计算它们的分布并获取前3个值。

from collections import Counter

mask=dfx['string'].apply(lambda x: list(dict(Counter(x.split()).most_common()[:3]).keys()))
print(mask)

'''
st  string
us  ['with', 'was', 'he']
us1 ['the', 'and', 'The']
us2 ['words', 'random', 'Random']


'''

最后，将这前三个单词添加为新列。

dfx[['top1_word','top2_word','top3_word']]=pd.DataFrame(mask.tolist(), index= mask.index)

dfx

st  name_x_count    top1_word   top2_word   top3_word
us  2               with        was         he
us1 2               the         and         The
us2 1               words       random      Random

- Bushmaster

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Celius Stingher · Accepted Answer

我会先使用groupby()来对字符串进行串联，然后使用collections的Counter和most_common。最后将其重新分配给数据框。我使用x.lower()是因为否则“He”和“he”将被视为不同的单词（但如果这是有意的，您可以随时将其删除）:

output = df.groupby('st').agg(
    name_x_count = pd.NamedAgg('name_x','count'),
    string = pd.NamedAgg('string',' '.join))

在分组后，我们使用collections.Counter()创建列：

output[['top1_word','top2_word','top3_word']] = output['string'].map(lambda x: [x[0] for x in collections.Counter(x.lower().split()).most_common(3)])
output = output.drop(columns='string')

输出：

     name_x_count top1_word top2_word top3_word
st                                             
us              2        he      with       was
us1             2       the       and  overpass
us2             1    random     words        in