如何使用Python在文件中找到一组最频繁出现的词对？

Question

如何使用Python在文件中找到一组最频繁出现的词对？

5

我有一个数据集如下：

"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"

等等之类的

我想要找出最常出现的词对，例如：

(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)

这两个单词可以以任何顺序出现，并且可以相距任意远。

有人能在Python中提供可能的解决方案吗？这是一个非常大的数据集。

非常感谢任何建议。

所以，这是我在@275365的建议下尝试的：

@275365 我尝试了从文件读取输入的以下方法：

    def collect_pairs(file):
        pair_counter = Counter()
        for line in open(file):
            unique_tokens = sorted(set(line))  
            combos = combinations(unique_tokens, 2)
            pair_counter += Counter(combos)
            print pair_counter

    file = ('myfileComb.txt')
    p=collect_pairs(file)

文本文件与原始文件具有相同数量的行，但每行仅具有唯一标记。我不知道我错在哪里，因为当我运行此文件时，它会将单词拆分为字母，而不是按预期输出单词组合。当我运行此文件时，它输出了拆分的字母，而不是单词组合，我不知道我犯了什么错误。

- ajaanbaahu

'two and a half men' 是一个 token 还是五个 token？ - roippi

它是一个标记，任何在“”中的内容都是一个单一的标记。 - ajaanbaahu

1

你尝试过什么？ - RyPeck

使用暴力解决方案是一种选项，其中我计算唯一令牌的可能排列数量，并运行整个数据集。但即使数据集稍微大一点，这种方法也会失败。 - ajaanbaahu

“could be...at any distance from each other”是什么意思？在你的示例数据集中，(Statistics,TopGear)是一对吗？或者我们可以假设只有同一行中的单词可以成对出现？ - steveha

显示剩余2条评论

2个回答

0

除了计算所有成对的词语，你没有太多可以做的。

明显的优化方法是尽早删除重复单词和同义词，执行词干提取（任何减少不同标记数量的方法都是好的！），并且仅计算 (a,b) 中的成对词语，其中 a<b （在你的例子中，只计算 statistics,narnia 或 narnia,statistics 中的一个，而不是两个都计算！）。

如果内存不足，请执行两次遍历。在第一次遍历中，使用一个或多个哈希函数获取候选过滤器。在第二次遍历中，仅计算通过此过滤器的单词（MinHash / LSH 样式过滤）。

这是一个天真的并行问题，因此也很容易分配到多个线程或计算机上。

- Has QUIT--Anony-Mousse

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Justin O Barber · Accepted Answer

你可以根据你的语料库大小从以下几个方面开始：

>>> from itertools import combinations
>>> from collections import Counter

>>> def collect_pairs(lines):
    pair_counter = Counter()
    for line in lines:
        unique_tokens = sorted(set(line))  # exclude duplicates in same line and sort to ensure one word is always before other
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter

结果如下：

>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]

你是否希望这些组合中包含数字？由于你没有明确提到要排除它们，因此我在此处已经包含了它们。

编辑：使用文件对象

你在上面发布的函数非常接近工作状态。唯一需要做的是将每行（字符串）更改为元组或列表。假设你的数据看起来与上面发布的数据完全相同（每个术语周围都有引号，逗号分隔术语），我建议一个简单的解决方法：你可以使用 ast.literal_eval。（否则，你可能需要使用某种正则表达式。）请参见下面的修改版本，其中包含 ast.literal_eval:

from itertools import combinations
from collections import Counter
import ast

def collect_pairs(file_name):
    pair_counter = Counter()
    for line in open(file_name):  # these lines are each simply one long string; you need a list or tuple
        unique_tokens = sorted(set(ast.literal_eval(line)))  # eval will convert each line into a tuple before converting the tuple to a set
        combos = combinations(unique_tokens, 2)
        pair_counter += Counter(combos)
    return pair_counter  # return the actual Counter object

现在您可以像这样测试它：

file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10)  # for example