输入
$ cat test.csv
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat
简而言之:
>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]
>>> df['POS'] = tagged_texts
>>> df
ID Task label \
0 1 Collect Information no response
1 2 New Credit no response
2 3 Collect Information response
3 4 Collect Information response
4 5 Collect Information response
Text \
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
POS
0 [(cozily, RB), (married, JJ), (practical, JJ),...
1 [(active, JJ), (married, VBD), (expensive, JJ)...
2 [(healthy, JJ), (single, JJ), (expensive, JJ),...
3 [(cozily, RB), (married, JJ), (practical, JJ),...
4 [(cozily, RB), (single, JJ), (practical, JJ), ...
简述:
首先,您可以将Text
列提取为字符串列表:
texts = df['Text'].tolist()
然后您可以应用word_tokenize
函数:
map(word_tokenize, texts)
请注意,@Boud的建议几乎相同,使用
df.apply
:
df['Text'].apply(word_tokenize)
然后您将分词后的文本转储到一个字符串列表的列表中:
df['Text'].apply(word_tokenize).tolist()
那么你可以使用
pos_tag_sents
:
pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
然后您将该列重新添加到DataFrame中:
df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )