如何获取分类模型的预测概率？

Question

如何获取分类模型的预测概率？

3

我正在使用二元依赖变量（占用/未占用）尝试不同的分类模型。我感兴趣的模型有逻辑回归、决策树和高斯朴素贝叶斯。

我的输入数据是一个 csv 文件，其中包括日期时间索引（例如 2019-01-07 14:00）、三个变量列（“R”、“P”、“C”，包含数值）、以及依赖变量列（“value”，包含二元值）。

训练模型不是问题，一切都运行良好。所有模型都给出了它们的二元预测值（这当然应该是最终结果），但我也想看到它们决定采用二元值的预测概率。有没有办法也得到这些值？

我已经尝试了所有与 yellowbrick 包配合使用的分类可视化工具（ClassBalance、ROCAUC、ClassificationReport、ClassPredictionError）。但所有这些工具都没有给出一个图表，展示模型对数据集计算的概率。

import pandas as pd
import numpy as np
data = pd.read_csv('testrooms_data.csv', parse_dates=['timestamp'])


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

##split dataset into test and trainig set
X = data.drop("value", axis=1) # X contains all the features
y = data["value"] # y contains only the label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1)

###model training
###Logistic Regression###
clf_lr = LogisticRegression()

# fit the dataset into LogisticRegression Classifier

clf_lr.fit(X_train, y_train)
#predict on the unseen data
pred_lr = clf_lr.predict(X_test)

###Decision Tree###

from sklearn.tree import DecisionTreeClassifier

clf_dt = DecisionTreeClassifier()
pred_dt = clf_dt.fit(X_train, y_train).predict(X_test)

###Bayes###
from sklearn.naive_bayes import GaussianNB

bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)


###visualization for e.g. LogReg
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC

#classificationreport
visualizer = ClassificationReport(clf_lr, support=True)

visualizer.fit(X_train, y_train)  # Fit the visualizer and the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data

#classprediction report
visualizer2 = ClassPredictionError(LogisticRegression())

visualizer2.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer2.score(X_test, y_test) # Evaluate the model on the test data
g2 = visualizer2.poof() # Draw visualization

#(ROC)
visualizer3 = ROCAUC(LogisticRegression())

visualizer3.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer3.score(X_test, y_test)  # Evaluate the model on the test data
g3 = visualizer3.poof()             # Draw/show/poof the data

如果有一个类似于pred_lr的数组，其中包含对csv文件中每一行计算得出的概率，那将是很好的。这是否可能？如果是，我该如何获得它？

- joey11235

我投票关闭此问题，因为答案直接在相关文档中可以找到，例如逻辑回归、决策树等。 - desertnaut

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- BCJuan · Accepted Answer

在大多数sklearn估计器中（如果不是全部），您都可以获得一个方法来获取预测分类之外的概率，无论是以对数概率还是概率的形式。

例如，如果您有Naive Bayes分类器，并且想要获得概率而不是分类本身，您可以执行以下操作（我使用了与您代码中相同的命名法）：

from sklearn.naive_bayes import GaussianNB

bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)

#for probabilities
bayes.predict_proba(X_test)
bayes.predict_log_proba(X_test)

希望这能有所帮助。