用Python实现高效的1-5克提取方法

16

我有一个包含3,000,000行的巨大文件,每行都有20-40个单词。我必须从这个语料库中提取1到5个ngram。我的输入文件是已分词的纯文本,例如:

This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .

目前,我是按照以下方式进行操作,但这似乎不是一种有效的提取1-5grams的方法:

#!/usr/bin/env python -*- coding: utf-8 -*-

import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')

with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
    # Extract words from file. 
    src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
    del src_words[-1] # Removes the final '<s>'
    trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
    del trg_words[-1] # Removes the final '<s>'

    # Unigrams count.
    src_unigrams = Counter(src_words) 
    trg_unigrams = Counter(trg_words) 
    # Sum of unigram counts.
    src_sum_unigrams = sum(src_unigrams.values())
    trg_sum_unigrams = sum(trg_unigrams.values())

    # Bigrams count.
    src_bigrams = Counter(zip(src_words,src_words[1:]))
    trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
    # Sum of bigram counts.
    src_sum_bigrams = sum(src_bigrams.values())
    trg_sum_bigrams = sum(trg_bigrams.values())

    # Trigrams count.
    src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
    trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
    # Sum of trigram counts.
    src_sum_trigrams = sum(src_bigrams.values())
    trg_sum_trigrams = sum(trg_bigrams.values())

还有其他更高效的方法吗?

如何同时最佳地提取不同长度的n元组?

来自Python中快速/优化的N元组实现,本质上是这样的:

zip(*[words[i:] for i in range(n)])

当二元组的硬编码为 n=2 时:

zip(src_words,src_words[1:])

而这是针对三元组,n=3

zip(src_words,src_words[1:],src_words[2:])

2
输入文件的格式是什么? - mbatchkarov
3个回答

8
如果你只对最常见的(频繁的)n元组感兴趣(我想这是你的情况),你可以重复使用Apriori算法的核心思想。给定s_min,它被认为是给定n元组所包含的行数的最小支持度,该算法能够高效地搜索所有这样的n元组。
这个想法如下:编写一个查询函数,该函数接收一个n-gram并测试它在语料库中包含多少次。准备好这样的函数后(可以根据后面讨论进行优化),扫描整个语料库并获取所有的1-gram,即裸令牌,并选择那些至少出现s_min次的令牌。这给您频繁1-gram的子集F1。然后通过组合来自F1的所有1-gram来测试所有可能的2-gram。再次选择符合s_min标准的那些,并获得F2。通过组合F2中的所有2-gram并选择频繁的3-gram,您将获得F3。重复直到Fn非空为止。
许多优化可以在这里进行。当从 Fn 中组合 n 元组时,可以利用这样的事实:仅当x [1:]== y [:-1](如果使用适当的哈希,则可以在任何n上以恒定时间检查)才可以将n 元组x y 组合成(n + 1)元组。此外,如果您有足够的RAM(对于您的语料库,需要很多GB),则可以极大地加快查询功能。对于每个1 元组,请存储包含给定1 元组的行索引的哈希集。将两个n 元组组合成(n + 1)元组时,使用相应集合的交集,获得可能包含(n + 1)元组的行的集合。
时间复杂度随着s_min的减少而增加。美妙的是,由于算法运行时完全过滤了不频繁(因此不感兴趣)的n 元组,因此为只有频繁的元组节省了计算时间。

2
假设您不想在行之间计算ngrams,并且假设进行了简单的分词:
def ngrams(n, f):
    deque = collections.deque(maxlen=n)
    for line in f:
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1]) # pre-seed so 5-gram counter doesn't count incomplete 5-grams
        for word in words[n-1:]:
            deque.append(word)
            yield tuple(str(w) for w in deque) # n-gram tokenization
counters = [collections.Counter(ngrams(i, open('somefile.txt'))) for i in range(5)]

编辑: 添加了起始/结束行标记

我认为生成的数据对象应该尽可能地稀疏。40个单词的3m行大约有120m个标记。英语中有大约100万个单词(虽然不是所有单词都常用),你可能会得到一个相当长的尾部。如果您可以想象您的数据是可交换的/iid,那么您可以在中间添加一些修剪:

def ngrams(n, f, prune_after=10000):
    counter = collections.Counter()
    deque = collections.deque(maxlen=n)
    for i, line in enumerate(f):
        deque.clear()
        words = ["<s>"] + line.split() + ["</s>"]
        deque.extend(words[:n-1])
        for word in words[n-1:]:
            deque.append(word)
            ngram = tuple(str(w) for w in deque)
            if i < prune_after or ngram in counter:
                counter[ngram] += 1
    return counter

放宽可交换性假设可能需要像Tregoreg的答案那样进行有效修剪,但在大多数情况下可交换性应该成立。

就原始速度而言,我认为zip(就像原始代码一样)与deque是基本问题。 zip消除了最内部的循环,因此很可能已经非常快。 deque需要最内部的循环,但也会迭代地消耗数据,因此其工作内存占用应该更小。哪个更好可能取决于您的机器,但我想对于大型机器/小型数据,zip会更快。然而,一旦您开始耗尽内存(特别是如果您开始谈论修剪),deque就会获得更多优势。


他添加这些标记是因为在句子的开头和结尾有额外的“单词”很有用。例如,对于语言模型来说,这非常有用。如果您想了解更多信息,请参阅此Coursera课程上关于“语言建模”的部分。 - HerrKaputt
那很有道理,我想我会把它考虑为修改后的n-gram。 - metaperture

2
我将给你一些关于你正在尝试解决的一般问题的指针。这些指针中的一个或多个应该对你有用,并帮助你找出这个问题。
对于你现在做的事情(我猜测是某种机器翻译实验),你不需要真的把srcfin和trgfin这两个文件同时加载到内存中(至少对于你提供的代码示例来说不需要)。将它们分别处理会节省更多内存。
你正在将大量数据读入内存,处理它(这需要更多的内存),然后将结果保存在一些内存数据结构中。与其这样做,你应该尽量懒惰。学习Python生成器并编写一个生成器,可以在不需要在任何给定时间点上保持整个文本的情况下流出给定文本中的所有ngram。在编写此过程时,itertools Python包可能会很有用。
在某个阶段之后,将不再可能将所有这些数据保存在内存中。你应该考虑使用map-reduce来帮助你拆分数据。检查mrjob Python包,它允许您在Python中编写MapReduce作业。在mapper步骤中,您将把文本分解成其ngrams,而在reducer阶段中,您将计算每个ngram出现的总次数以获得其总计数。mrjob也可以在本地运行,这显然不会为您带来任何并行化好处,但是因为mrjob仍然可以为您提供很多繁重的工作,因此会非常好。
如果你被迫同时将所有计数保存在内存中(对于大量文本),那么请实现一些修剪策略以修剪出非常罕见的ngram,或者考虑使用某些基于文件的持久查找表(例如sqlite)来为你保存所有数据。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接