清理 .txt 文件并计算最常见的单词数

Question

清理 .txt 文件并计算最常见的单词数

pythonstringpython-2.7word-count

3

我需要：

1）从一个单独的 .txt 文件中清除停用词列表。

2）然后计算出前 25 个最常见的单词。

这是我为第一部分想到的解决方案：

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

import re
from collections import Counter

f=open("text_to_be_cleaned.txt")
txt=f.read()
with open("stopwords.txt") as f:
    stopwords = f.readlines()
stopwords = [x.strip() for x in stopwords]

querywords = txt.split()
resultwords  = [word for word in querywords if word.lower() not in stopwords]
cleantxt = ' '.join(resultwords)

第二部分，我正在使用以下代码：

words = re.findall(r'\w+', cleantxt)
lower_words = [word.lower() for word in words]
word_counts = Counter(lower_words).most_common(25)
top25 = word_counts[:25]

print top25

需要清理的源文件如下所示：

(b)

在第二段第一句话中，在“并向高级代表”一词后插入；在第二句话中，将“它将进行一次年度辩论”替换为“每年它将进行两次辩论”，并在末尾插入“包括共同的安全和防御政策”。

停用词列表如下所示： this thises they thee the then thence thenest thener them

当我运行所有内容时，输出结果仍包含停用词列表中的单词：
[('article', 911), ('european', 586), ('the', 586), ('council', 569), ('union', 530), ('member', 377), ('states', 282), ('parliament', 244), ('commission', 230), ('accordance', 217), ('treaty', 187), ('in', 174), ('procedure', 161), ('policy', 137), ('cooperation', 136), ('legislative', 136), ('acting', 130), ('act', 125), ('amended', 125), ('state', 123), ('provisions', 115), ('security', 113), ('measures', 111), ('adopt', 109), ('common', 108)]

正如您可能已经注意到的那样，我刚刚开始学习Python，因此非常感谢易于理解的解释！ :)

可以在此处找到使用的文件：

停用词列表

需要清理的文件

编辑：添加了源文件，停用词文件和输出结果的示例。提供了源文件。

- Cold2Breath

1

顺便说一下：我认为你不需要[x.strip() ...这个推导式。它是多余的。 - sshashank124

1

@sshashank124 如果没有 x.strip，每行末尾会有像 /n 这样的空格字符，对吗？ - Cold2Breath

你的代码看起来是正确的。停用词已经转换为小写了吗？ - James

@James 是的，停用词都是小写的，它本质上是一个单词列表，每行一个单词，在一个单独的 .txt 文件中。 - Cold2Breath

@tobias_k 说实话，这是我在stackoverflow上找到的。 - Cold2Breath

显示剩余5条评论

2个回答

1

你的代码差不多就完成了，主要错误在于你运行正则表达式\w+来对经过str.split处理后的单词进行分组。这行不通，因为标点仍然会附加到str.split的结果上。请尝试使用以下代码。

import re
from collections import Counter

with open('treaty_of_lisbon.txt', encoding='utf8') as f:
    target_text = f.read()

with open('terrier-stopwords.txt', encoding='utf8') as f:
    stop_word_lines = f.readlines()

target_words = re.findall(r'[\w-]+', target_text.lower())
stop_words = set(map(str.strip, stop_word_lines))

interesting_words = [w for w in target_words if w not in stop_words]
interesting_word_counts = Counter(interesting_words)

print(interesting_word_counts.most_common(25))

- Jared Goguen

非常感谢，这个有效！如果可以的话，我想问一些问题，这样我就能理解你刚才做了什么，那太好了！ :)1）编码是绝对必要的吗？因为它会产生错误，而且没有它代码似乎也能正常工作。2）你在这里做什么：set(map(str.strip, stop_word_lines))？ - Cold2Breath

(1) 不，我只是添加了它以使代码在Python 3.X中运行。 (2) 这相当于 set([word.strip() for word in stop_word_lines])。 - Jared Goguen

嗯，基本上我做的事情和这完全一样。我不知道为什么我的代码在这里运行正常，但对于原帖作者却不行… - tobias_k

@Tobias 说实话，可能是因为它被作为一个单独的代码块给出，而不是通过解释分隔成单独的行。OP 可能在尝试将您的评论应用于他们的代码时出现了错误。 - Jared Goguen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tobias_k · Accepted Answer

这只是一种猜测，但我认为问题在这里：

querywords = txt.split()

你刚刚对文本进行了拆分，这意味着一些停用词可能仍然与标点符号粘在一起，因此在下一步中不会被过滤掉。

>>> text = "Text containing stop words like a, the, and similar"
>>> stopwords = ["a", "the", "and"]
>>> querywords = text.split()
>>> cleantxt = ' '.join(w for w in querywords if w not in stopwords)
>>> cleantxt
'Text containing stop words like a, the, similar'

相反，你可以像在代码中后面所做的那样使用re.findall：

>>> querywords = re.findall(r"\w+", text)
>>> cleantxt = ' '.join(w for w in querywords if w not in stopwords)
>>> cleantxt
'Text containing stop words like similar'

请注意，这将分割复合词，例如"re-arranged"，分成"re"和"arranged"。如果这不是您想要的，您也可以使用此方法仅按空格拆分，然后修剪（一些）标点符号字符（文本中可能会有更多）。

querywords = [w.strip(" ,.-!?") for w in txt.split()]

Changing just that one line seems to fix the problem for the input files you provided.

The rest looks okay, though there are a few minor points:

- 你应该将 stopwords 转换为 set，这样查找就是 O(1)，而不是 O(n) - 如果还没有转换，请确保将停用词 lower - 如果您打算随后再次拆分，请不要 ' '.join 清理后的文本 - top25 = word_counts[:25] 是多余的，列表已经最多只有 25 个元素。