使用生成器处理大型语料库的TfidfVectorizer

Question

使用生成器处理大型语料库的TfidfVectorizer

pythonscikit-learngeneratorcorpustfidfvectorizer

4

我有一个大语料库，分成了5K个文件，我正在尝试使用TF-IDF转换生成基于IDF的词汇表。

以下是代码：基本上，我有一个迭代器，循环遍历目录中的.tsv文件，读取每个文件并产生输出。

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import pandas as pd
import numpy as np
import os
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def make_corpus():
    inputFeatureFiles = [x for x in os.listdir('C:\Folder') if x.endswith("*.tsv")]
    for file in inputFeatureFiles:
        filePath= 'C:\\' + os.path.splitext(file)[0] + ".tsv"
        with open(filePath, 'rb') as infile:
            content = infile.read()
            yield content 

corpus = make_corpus()
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True, max_df=0.7, smooth_idf=True)

vectorizer.fit_transform(corpus)

这会产生以下错误：

c:\python27\lib\site-packages\sklearn\feature_extraction\text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    809             vocabulary = dict(vocabulary)
    810             if not vocabulary:
--> 811                 raise ValueError("empty vocabulary; perhaps the documents only"
    812                                  " contain stop words")
    813 

ValueError: empty vocabulary; perhaps the documents only contain stop words

我也尝试了这个:

corpusGenerator= [open(os.path.join('C:\CorpusFiles\',f)) for f in os.listdir('C:\CorpusFiles')]
vectorizer = TfidfVectorizer(stop_words='english',use_idf=True,smooth_idf=True, sublinear_tf=True, input="file", min_df=1)
feat = vectorizer.fit_transform(corpusGenerator)

并且收到以下错误：

[Errno 24] Too many open files: 'C:\CorpusFiles\file1.tsv'

如何在大型语料库上最好地使用TFIDFVectorizer？我还尝试将一个常量字符串附加到每个yield字符串以避免第一个错误，但这也没有解决问题。感谢任何帮助！

- VJSharp

在调试TfidfVectorizer时它是有效的。但是当我尝试将其作为函数调用时，它会抛出相同的异常。 - Umer

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JackyYoung · Answer 1

嘿，最近我也研究了同样的问题。根据我的经验，也许你可以尝试以下演示代码：

import glob
all_files_path = glob.glob(path_to_the_dir_of_your_data_files)

def fit_iterator():
    for file_path in all_files_path:
        with open(file_path, "r", encoding="utf-8") as file:
            for line in file:
                yield line # please make sure that line is a instance of str
                           # representing a single sample.

corpus = fit_iterator()
tfidf = TfidfVectorizer()
tfidf.fit(corpus)

祝你好运！