scikit-learn中的TfidfVectorizer：ValueError：np.nan是无效文档。

Question

scikit-learn中的TfidfVectorizer：ValueError：np.nan是无效文档。

pythonpandasmachine-learningscikit-learntf-idf

68

我正在使用scikit-learn中的TfidfVectorizer从文本数据中提取一些特征。我有一个具有分数（可以为+1或-1）和评论（文本）的CSV文件。我将这些数据转移到DataFrame中，以便可以运行向量化器。

这是我的代码:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("train_new.csv",
             names = ['Score', 'Review'], sep=',')

# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()

v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])

这是我收到的错误追踪信息：

Traceback (most recent call last):
  File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module>
x = v.fit_transform(df['Review'])
 File "/home/b/hw1/local/lib/python2.7/site-   packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform
   X = super(TfidfVectorizer, self).fit_transform(raw_documents)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
 File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
   for feature in analyze(doc):
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
 raise ValueError("np.nan is an invalid document, expected byte or "
 ValueError: np.nan is an invalid document, expected byte or unicode string.

我检查了CSV文件和DataFrame，没有发现任何被读取为NaN的内容。共有18000行，其中没有一行返回isnan的值为True。

这是通过df['Review'].head()查看到的内容：

  0    This book is such a life saver.  It has been s...
  1    I bought this a few times for my older son and...
  2    This is great for basics, but I wish the space...
  3    This book is perfect!  I'm a first time new mo...
  4    During your postpartum stay at the hospital th...
  Name: Review, dtype: object

- boltthrower

1

你能否显示一下 df['Review'] 的头部，因为它与数据框中文本的编码相关，而不是其他任何内容？ - Nickil Maveli

当然，我刚刚编辑了我的帖子。 - boltthrower

还有type(df['Review'].iloc[0])？ - Nickil Maveli

type(df['Review'].iloc[0]) 给我返回 <type 'str'>。 - boltthrower

3个回答

24

我发现了一种更高效的解决此问题的方法。

x = v.fit_transform(df['Review'].apply(lambda x: np.str_(x)))

当然你可以使用 df['Review'].values.astype('U') 来转换整个序列。但我发现如果要转换的序列非常大，使用此函数会消耗更多的内存。（我使用包含 800k 行数据的序列进行测试，执行 astype('U') 将会消耗约 96GB 的内存）

相反，如果你使用 lambda 表达式仅将序列中的数据从 str 转换为 numpy.str_，这样的结果也会被 fit_transform 函数接受，这样做速度更快，且不会增加内存使用量。

我不确定为什么这样做有效，因为在 TFIDF Vectorizer 的文档页面上是这样写的：

fit_transform(raw_documents, y=None)

Parameters: raw_documents : iterable

an iterable which yields either str, unicode or file objects

但实际上这个可迭代对象必须以 np.str_ 的形式而不是 str 的形式出现。

- Andy Ma

非常感谢您提供的解决方案。您能否详细解释一下 np.str_ 和 str 之间的可迭代位是什么？我还是个新手，有些困惑，我以为 str 是可迭代的？谢谢。 - ML33M

10

在我的数据集中使用.values.astype('U')对评论进行转换后仍然出现了MemoryError错误。

因此，我尝试了.astype('U').values，它起作用了。

这个答案来自：Python：如何避免使用astype（'U'）转换文本数据为Unicode时出现MemoryError错误

- ashish

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nickil Maveli · Accepted Answer

你需要按照回溯中明确提到的，将dtype为object的数据类型转换为unicode字符串。

x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

从TFIDF向量化器的文档页面介绍：

fit_transform(raw_documents, y=None)

参数: raw_documents : iterable
一个可迭代对象，可以生成str, unicode或文件对象