如何将预处理器传递给TfidfVectorizer？- sklearn - python

Question

如何将预处理器传递给TfidfVectorizer？- sklearn - python

16

我如何将预处理器传递给TfidfVectorizer？我创建了一个函数，该函数接受字符串并返回预处理后的字符串，然后我将processor参数设置为该函数“preprocessor = preprocess”，但它不起作用。我已经搜索了很多次，但没有找到任何例子，好像没有人使用它。

我还有一个问题。这个（preprocessor参数）会覆盖使用stop_words和lowercase参数进行停用词移除和小写转换吗？

- eman

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David · Accepted Answer

你只需要定义一个函数，该函数以字符串作为输入，并返回要进行预处理的内容。例如，将字符串转换为大写的简单函数如下所示：

```python def preprocess(text: str) -> str: return text.upper() ``` ```text

```python def preprocess(text: str) -> str: return text.upper() ```

def preProcess(s):
    return s.upper()

创建好函数后，只需将其传递到TfidfVectorizer对象中即可。例如：

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?'
     ]

X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()

结果为：

[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']

尽管将小写字母设置为true，但预处理函数转换为大写字母的操作覆盖了它，因此间接回答了您的后续问题。文档中也有提到：

preprocessor : callable or None (default) 覆盖预处理（字符串转换）阶段，同时保留分词和n-grams生成步骤。