在scikit-learn管道中使用Word2Vec

Question

在scikit-learn管道中使用Word2Vec

3

我正在尝试在这份数据样本上运行w2v

Statement              Label
Says the Annies List political group supports third-trimester abortions on demand.       FALSE
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.         TRUE
"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran."""     TRUE
Health care reform legislation is likely to mandate free sex change surgeries.    FALSE
The economic turnaround started at the end of my term.     TRUE
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.    TRUE
Jim Dunnam has not lived in the district he represents for years now.    FALSE

使用此 GitHub 文件夹（FeatureSelection.py）中提供的代码：

https://github.com/nishitpatel01/Fake_News_Detection

我想在朴素贝叶斯模型中包含word2vec特征。首先，我考虑了X和y并使用了train_test_split：

X = df['Statement']
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

dataset = pd.concat([X_train, y_train], axis=1)

这是我目前在使用的代码:

#Using Word2Vec 
with open("glove.6B.50d.txt", "rb") as lines:
    w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
           for line in lines}

training_sentences = DataPrep.train_news['Statement']

model = gensim.models.Word2Vec(training_sentences, size=100) # x be tokenized text
w2v = dict(zip(model.wv.index2word, model.wv.syn0))


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X, y): # what are X and y?
        return self

    def transform(self, X): # should it be training_sentences?
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])


"""
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())
    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
        return self
    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])
"""

在 classifier.py 中，我正在运行。

nb_pipeline = Pipeline([
        ('NBCV',FeaturesSelection.w2v),
        ('nb_clf',MultinomialNB())])

然而这并没有奏效，我得到了以下错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-14-07045943a69c> in <module>
      2 nb_pipeline = Pipeline([
      3         ('NBCV',FeaturesSelection.w2v),
----> 4         ('nb_clf',MultinomialNB())])

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
    112         self.memory = memory
    113         self.verbose = verbose
--> 114         self._validate_steps()
    115 
    116     def get_params(self, deep=True):

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
    160                                 "transformers and implement fit and transform "
    161                                 "or be the string 'passthrough' "
--> 162                                 "'%s' (type %s) doesn't" % (t, type(t)))
    163 
    164         # We allow last estimator to be None as an identity transformation

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '{' ': array([-0.17019527,  0.32363772, -0.0770281 , -0.0278154 , -0.05182227, ....

我正在使用该文件夹中的所有程序，因此如果您也使用它们，代码将能够再现。

如果您能够向我解释如何修复以及代码中需要进行哪些其他更改，那就太好了。我的目标是使用BoW、TF-IDF和Word2Vec比较模型（朴素贝叶斯、随机森林等）。

更新：

在下面的回答（来自Ismail）之后，我根据以下方式更新了代码：

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

和

#building Linear SVM classfier
svm_pipeline = Pipeline([
        ('svmCV',FeaturesSelection_W2V.MeanEmbeddingVectorizer(FeaturesSelection_W2V.w2v)),
        ('svm_clf',svm.LinearSVC())
        ])

svm_pipeline.fit(DataPrep.train_news['Statement'], DataPrep.train_news['Label'])
predicted_svm = svm_pipeline.predict(DataPrep.test_news['Statement'])
np.mean(predicted_svm == DataPrep.test_news['Label'])

然而，我仍然遇到错误。

- LdM

你能想到一个可以从头到尾运行的 [reprex] 吗？至于在 FS 程序中是否采用“用密集的 word2vec 替换稀疏的 tfidf”的方法是正确的，这肯定是可行的，但如果你的最终目标是识别假新闻，那么这并不能让你更接近实现目标。为了达到这个目标，你需要提取事实并将它们与你认为的真相进行比较。 - Sergey Bushmanov

当我从链接中的程序中取消注释模型和w2v时，我会收到以下错误提示：所有中间步骤都应该是转换器并实现fit和transform或者是字符串。因此，我认为缺少了一些步骤，希望有人能够解释并展示缺失的内容以及如何修复它。 - LdM

为了让其他人帮助你，你的错误需要是可重现的（顺便说一下，你可以考虑更新你的问题并附上错误信息）。而且它应该是最小化的。这就是 因为有太多的代码。请考虑使用 [ask] 和 [reprex]。另外，你的 sklearn transformer 应该继承 BaseEstimator 和 TransformerMixin ，以便在 sklearn pipeline 中运行，但我不知道是否足以让你的程序执行，因为我不知道如何运行它。 - Sergey Bushmanov

1

请查看更新。我不知道如何改进代码。所有内容都在链接中（提供可重现的示例）。非常感谢。 - LdM

1

错误非常明显（但这是您的最后一个问题）：Pipeline 的每个步骤都必须实现 fit 和 transform。相反，您正在传递一个 dict。但这是您的最后一个问题，因为您定义 w2v 的代码一开始就没有任何意义。首先，您加载了预训练向量到一个字典中。然后，您定义了一个 Word2Vec 模型。接下来，您创建了一个字典，将 Glove 文件中的单词与模型的未经训练的向量进行了压缩。 (.-.) - janluke

不幸的是，我还没有找到处理这个问题的方法，所以我会保持问题的开放，以防你或其他人能帮助我实现它而没有出现错误。 - LdM

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ismail Durmaz · Accepted Answer

步骤1. 多项式朴素贝叶斯（MultinomialNB）FeaturesSelection.w2v 是一个 dict，没有 fit 或 fit_transform 函数。此外，MultinomialNB 需要非负数值，因此无法使用。因此，我决定添加一个预处理阶段来规范化负值。

from sklearn.preprocessing import MinMaxScaler

nb_pipeline = Pipeline([
        ('NBCV',MeanEmbeddingVectorizer(FeatureSelection.w2v)),
        ('nb_norm', MinMaxScaler()),
        ('nb_clf',MultinomialNB())
    ])

... 而不是

nb_pipeline = Pipeline([
        ('NBCV',FeatureSelection.w2v),
        ('nb_clf',MultinomialNB())
    ])

步骤2。我在word2vec.itervalues().next()上遇到了一个错误。所以我决定改变维度形状，使用预定义的与Word2Vec的大小相同的值。

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

... 而不是

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())