从文本列表中删除单词

3

我试图从文本字符串的列表中删除某些单词(除了使用停用词),但由于某些原因它没有生效。

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

exclude = ['am', 'there','here', 'for', 'of', 'user']

new_doc = [word for word in documents if word not in exclude]

print new_doc

输出

['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']

正如您所看到的,EXCLUDE 中的单词不会从 DOCUMENTS 中删除(例如,“for”就是一个很好的例子)。

它使用这个运算符:

new_doc = [word for word in str(documents).split() if word not in exclude]

但是如何在“已清除”后将初始元素(虽然是“已清除的”)重新获取到DOCUMENTS中呢?

非常感谢你的帮助!


1
word 不是一个单词,它是一整行(例如:“用于实验 abc 计算机应用的人机界面”),因此永远不会在 exclude 中。 - jonrsharpe
@jonrsharpe - 刚刚添加了一个更正,但问题仍然存在(略有不同)。 - Toly
2个回答

3

在过滤文本之前,您应该将每一行拆分成单词:

new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]

刚刚进行了更正,但问题仍然存在(有点不同)。明白了!谢谢! - Toly

1
你正在遍历句子而不是单词。为此,你需要拆分句子并使用嵌套循环来遍历单词并筛选它们,然后连接结果。
>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>> 
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>> 

另外,你可以使用regex来替换exclude单词为空字符串,而不是使用嵌套的列表推导式、分割和过滤。具体方法是使用re.sub函数:

>>> import re
>>> 
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface  lab abc computer applications', 'A survey   opinion  computer system response time', 'The EPS  interface management system', 'System and human system engineering testing  EPS', 'Relation   perceived response time to error measurement', 'The generation  random binary unordered trees', 'The intersection graph  paths in trees', 'Graph minors IV Widths  trees and well quasi ordering', 'Graph minors A survey']
>>> 

r'|'.join(exclude) 会用竖杠(在正则表达式中表示逻辑或)连接单词。


太棒了!在你的看法中,大文本使用哪种方法更有效率? - Toly
@Toly 是的,我也这么认为。 - Mazdak
你会在处理大型文本文件时使用正则表达式还是嵌套推导式? - Toly
@Toly使用正则表达式比分割、循环和过滤更高效。 - Mazdak
@Toly 你可以使用 timeit 模块对这两种方法进行基准测试。 - Mazdak

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接