使用scikit-learn进行多类别分类

Question

使用scikit-learn进行多类别分类

82

我想使用scikit-learn中的监督学习方法将文本分类到一个或多个类别中。我尝试使用所有算法的预测函数，但只返回一个匹配项。

例如，我有一段文本：

"Theaters in New York compared to those in London"

我已经训练好了算法，可以为我输入的每个文本片段选择一个地点。

在上面的例子中，我希望它返回New York和London，但它只返回New York。

是否可能使用scikit-learn返回多个结果？甚至返回下一个最高概率的标签？

谢谢你的帮助。

---更新

我尝试使用OneVsRestClassifier，但每个文本片段仍然只返回一个选项。以下是我使用的示例代码:

y_train = ('New York','London')


train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()


base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

结果: ['New York' 'London' 'London']

- CodeMonkeyB

5个回答

61

编辑：根据建议更新了Python 3、scikit-learn 0.18.1版本，并使用MultiLabelBinarizer进行改进。

我也一直在研究这个问题，并对mwv的优秀答案进行了轻微的改进，可能会有所帮助。它以文本标签作为输入，而不是二进制标签，并使用MultiLabelBinarizer进行编码。

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

这使我得到以下输出:

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => london
it is raining in britian and the big apple => new york
it is raining in britian and nyc => london, new york
hello welcome to new york. enjoy it here and london too => london, new york

- J Maurer

13

labelBinarizerĺ·˛ç»Źčż‡ć—¶ďĽŚčŻ·ä˝żç”¨lb = preprocessing.MultiLabelBinarizer()ä»Łć›żă€‚ - Roman

2

根据scikit-learn的说法，除了sklearn.svm.SVC之外，所有线性模型都支持One-Vs-All，而决策树、随机森林和最近邻居支持多标签。因此，我不会在这种类型的任务（即多标签分类，我假设您想使用它）中使用LinearSVC()。 - PeterB

2

值得一提的是，@mindstorm所说的One-Vs-All对应于scikit-learn类"OneVsRestClassifier"（注意不是"all"而是"Rest"）。此scikit-learn帮助页面对此进行了澄清。 - lucid_dreamer

1

正如@mindstorm所提到的，确实在这个页面上，文档中提到：“One-Vs-All：所有线性模型，除了sklearn.svm.SVC”。然而，在scikit-learn文档的另一个多标签示例中，有这样一行代码classif = OneVsRestClassifier(SVC(kernel='linear'))。感到困惑。 - lucid_dreamer

1

有没有办法让它不处理标签？例如，如果我添加一个测试例子“我想要饼干”，它会将其标记为“纽约”和“伦敦”。 - Omar Meky

显示剩余4条评论

8

我也遇到了这个问题，我的 y_Train 是一系列字符串而不是字符串的序列。显然，OneVsRestClassifier 根据输入标签的格式决定使用多类别还是多标签。所以请更改代码：

y_train = ('New York','London')

to

y_train = (['New York'],['London'])

显然，在将来，这将会消失，因为如果所有标签都相同，则会断开连接: https://github.com/scikit-learn/scikit-learn/pull/1987

- user2824135

8

将此行更改以使其在新版本的Python中正常工作

# lb = preprocessing.LabelBinarizer()
lb = preprocessing.MultiLabelBinarizer()

- Srini Sydney

2

以下是一些多分类示例：

示例1：

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

输出结果是

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Example 2:-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array(['Leopard','Lion','Tiger', 'Lion'])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

输出结果为：

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]

- Goyal Vicky

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mwv · Accepted Answer

您需要的是称为多标签分类的东西。Scikits-learn可以做到。看这里：http://scikit-learn.org/dev/modules/multiclass.html。我不确定在您的示例中出了什么问题，我的sklearn版本显然没有WordNGramAnalyzer。也许这是使用更多训练样本或尝试不同的分类器的问题？但请注意，多标签分类器希望目标是标签的元组/列表的列表。以下对我有效：

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

对我来说，这会产生以下输出：

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London