如何高效地将pos_tag_sents()应用于pandas数据框?

15

在需要对保存在pandas数据框中以每行1个句子的形式存储的文本列进行POS标记的情况下,大多数SO上的实现使用apply方法。

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

NLTK文档建议使用pos_tag_sents()高效地标记多个句子。

这是否适用于此示例,如果是,那么更改pso_tagpos_tag_sents的代码是否简单,或者NLTK是指段落的文本来源?

如评论中所述,pos_tag_sents()旨在减少每次加载感知器的负担,但问题是如何做到这一点并仍在Pandas数据框中生成列?

样本数据集20kRows链接


你有多少行数据? - alvas
20000行将是行数的数量 - mobcdi
这不是问题。只需将该列提取为字符串列表,进行处理,然后将其添加回数据框中的列即可。 - alvas
你能提供一个编码示例吗? - mobcdi
你能提供数据示例吗?只需将你的dataframe.head()转储到csv文件中即可;P - alvas
现在添加了一个示例 CSV 文件的链接。 - mobcdi
3个回答

16

输入

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

简而言之:

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response   

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat   

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ... 

简述:

首先,您可以将Text列提取为字符串列表:

texts = df['Text'].tolist()

然后您可以应用word_tokenize函数:

map(word_tokenize, texts)

请注意,@Boud的建议几乎相同,使用df.apply
df['Text'].apply(word_tokenize)

然后您将分词后的文本转储到一个字符串列表的列表中:
df['Text'].apply(word_tokenize).tolist()

那么你可以使用pos_tag_sents
pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

然后您将该列重新添加到DataFrame中:
df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

3
你的“简短总结”比“详细版”还要长 :) - Louis Yang
@Louis Yang -- 很有趣!而且,比“长输入”部分还要长。但我刚刚仔细检查了一遍,它完全可以运行。 - hrokr

3
通过对每一行应用pos_tag,感知器模型将会被每次加载(这是一个昂贵的操作,因为它从磁盘读取一个pickle文件)。
相反地,如果您获取所有的行并将它们发送到pos_tag_sents(它接受list(list(str))),该模型只被加载一次并用于全部数据。
请参见源码

你能否提供一个示例,使用pos_tag_sents()函数,以pandas数据框的列作为源和目标,使得句子和标记输出在同一行? - mobcdi
我会瞎猜,因为我对Pandas不是很熟悉。也许可以尝试类似这样的代码:pos_tag_sents(map(word_tokenize, dfData['SourceText'].values())) - Iulius Curt

2

将此分配给您的新列:

dfData['POSTags'] = pos_tag_sents(dfData['SourceText'].apply(word_tokenize).tolist())

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接