用户警告：所有训练样本中都存在标签不为:NUMBER:的情况

Question

用户警告：所有训练样本中都存在标签不为:NUMBER:的情况

pythonscikit-learnclassificationtext-classificationmultilabel-classification

16

我正在进行多标签分类，尝试为每个文档预测正确的标签，以下是我的代码：

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df = 0.8, 
                                   min_df = 10)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

predicted = cross_val_predict(classifier, X, y)

运行我的代码时，我收到多个警告：

UserWarning: Label not :NUMBER: is present in all training examples.

当我打印出预测和真实标签时，大约一半的文档的标签预测为空。为什么会发生这种情况？这与训练过程中打印的警告有关吗？如何避免这些空的预测？

编辑01：使用除LinearSVC()之外的其他估计器也会出现这种情况。我尝试了RandomForestClassifier()，它也会给出空的预测。奇怪的是，当我使用cross_val_predict(classifier, X, y, method='predict_proba')来预测每个标签的概率而不是0/1的二进制决策时，对于给定的文档，每个预测集中至少有一个标签的概率> 0。那么为什么这个标签没有被选择作为二进制决策呢？或者二进制决策评估的方式与概率不同吗？

编辑02：我在一个旧的post中找到了一个OP处理类似问题的帖子。这是相同的情况吗？

- PeterB

你应该分享你在预测和拟合时使用的完整代码。 - Vivek Kumar

cross_val_predict 隐式地调用这些方法，我正在使用 Pipeline 方法。这是完整的代码。在将 y 标签转换为二进制向量并将其馈送到 cross_val_predict 方法之前，我仅使用了 MultiLabelBinarizer。 - PeterB

哦，是的。我忽略了你正在使用cross_val_predict。展示一些X和y的样本。 - Vivek Kumar

@VivekKumar 当然，Q现在应该是完整的了。 - PeterB

2

看起来你的第二次编辑的假设是正确的。开发人员非常明确地表示，如果你的数据存在类别不平衡问题，则返回空值是期望的行为。你能否在你正在使用的decision_function中输入一个日志记录语句，以查看你的数据是否只是对分类器的拟合效果较差？如果是这样，你可能需要增强你的决策函数，以控制你所期望的拟合程度。 - karnesJ.R

显示剩余2条评论

2个回答

2

我也遇到了同样的错误。然后我使用LabelEncoder()来编码标签，而不是使用MultiLabelBinarizer()。

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(Labels)

我不再遇到那个错误了。

- Vidya P V

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tonechas · Accepted Answer

为什么会发生这种情况，与训练过程中打印的警告相关吗？

问题可能是某些标签只出现在少数文档中（有关详细信息，请参见此线程）。当你将数据集划分为训练集和测试集以验证模型时，可能会发生一些标签在训练数据中缺失的情况。假设train_indices是一个包含训练样本索引的数组。如果特定标签（索引为k）不在训练样本中出现，则指示器矩阵y[train_indices]的第k列中的所有元素都为零。

如何避免这些空预测？

在上述场景中，分类器将无法可靠地预测测试文档中的k标签（有关更多信息，请参见下一段落）。因此，您不能信任clf.predict进行的预测，需要自己实现预测函数，例如使用由clf.decision_function返回的决策值，如此答案所建议的那样。

那么我不知道为什么这个标签没有被二元决策选择？或者二元决策的评估方式与概率有何不同？

在包含许多标签的数据集中，大多数标签的出现频率往往相当低。如果将这些较低值提供给二元分类器（即进行0-1预测的分类器），则该分类器非常可能会在所有文档的所有标签上选择0。

我找到了一篇旧帖子，OP正在处理类似的问题。这是相同的情况吗？

是的，绝对如此。那个人面临着与您完全相同的问题，他的代码与您的代码非常相似。

为了进一步解释问题，我使用模拟数据详细说明了一个简单的玩具示例。

Q = {'What does the "yield" keyword do in Python?': ['python'],
     'What is a metaclass in Python?': ['oop'],
     'How do I check whether a file exists using Python?': ['python'],
     'How to make a chain of function decorators?': ['python', 'decorator'],
     'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
     'MATLAB: get variable type': ['matlab'],
     'Why is MATLAB so fast in matrix multiplication?': ['performance'],
     'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
    }
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})    

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df=0.8, 
                                   min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

请注意，由于我的数据集比你的小得多，因此我已将min_df=1设置为了最小文档频率。当我运行以下句子时：

predicted = cross_val_predict(classifier, X, y)

我收到很多警告

C:\...\multiclass.py:76: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
C:\\multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))

并且以下预测：

In [5]: np.set_printoptions(precision=2, threshold=1000)    

In [6]: predicted
Out[6]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

所有条目都为 0 的行表示未对相应文档预测标签。

解决方法

为了分析的完整性，让我们手动验证模型，而不是通过 cross_val_predict。

import warnings
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
train_indices, test_indices = rs.split(X).next()

with warnings.catch_warnings(record=True) as received_warnings:
    warnings.simplefilter("always")
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    classifier.fit(X_train, y_train)
    predicted_test = classifier.predict(X_test)
    for w in received_warnings:
        print w.message

执行上述代码段时会发出两个警告（我使用上下文管理器确保警告被捕获）：

Label not 2 is present in all training examples.
Label not 4 is present in all training examples.

这与指数标记为2和4的标签在训练样本中缺失的事实一致：

In [40]: y_train
Out[40]: 
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1]])

对于一些文档，其预测结果为空（这些文档对应于在predicted_test中所有行都为零的行）：

In [42]: predicted_test
Out[42]: 
array([[0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0]])

为了解决这个问题，您可以像下面这样实现自己的预测函数：

为了克服这个问题，您可以实现自己的预测函数，如下：

def get_best_tags(clf, X, lb, n_tags=3):
    decfun = clf.decision_function(X)
    best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
    return lb.classes_[best_tags]

这样做，每个文档总是被分配具有最高置信度得分的n_tag标签:

In [59]: mlb.inverse_transform(predicted_test)
Out[59]: [('matlab',), (), (), ('matlab', 'naming-conventions')]

In [60]: get_best_tags(classifier, X_test, mlb)
Out[60]: 
array([['matlab', 'oop', 'matlab-oop'],
       ['oop', 'matlab-oop', 'matlab'],
       ['oop', 'matlab-oop', 'matlab'],
       ['matlab', 'naming-conventions', 'oop']], dtype=object)