如何在逻辑回归中找到Logistic / Sigmoidal函数参数

Question

如何在逻辑回归中找到Logistic / Sigmoidal函数参数

pythonmachine-learningscikit-learnlogistic-regression

4

我希望估算在医疗数据中用于逻辑回归的S型函数/逻辑函数的最佳参数（最后提到的：斜率和截距）。以下是我在Python中完成的操作：

import numpy as np
from sklearn import preprocessing, svm, neighbors
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing, svm, utils
from scipy.io import loadmat
import pandas as pd

我有一个Apache.mat文件，其中包含4列：Apache评分（0-72），病人数量，死亡数量，比例（死亡数量与病人数量的比率）

datamat = loadmat('Apache.mat')
data = pd.DataFrame(np.hstack((datamat['apacheII'], datamat['NoPatients'], 
datamat['NoDeaths'], datamat['proportion'])))

data.columns = ['apacheII', 'NoPatients', 'NoDeaths', 'proportion']

我已经创建了要处理的数据框。

x = np.array(data.drop(['NoPatients', 'NoDeaths', 'proportion'],1))

我已经删除了不需要的列，现在只剩下'x'中的ApacheII分数。

#scaling the data (normalizing)
x = preprocessing.scale(x)

y = np.array(data['proportion'])

现在，我已经使用LabelEncoder()函数对'y'进行编码，以便与LogisticRegression()兼容。

lab_enc = preprocessing.LabelEncoder()
encoded = np.array(lab_enc.fit_transform(y))

clf = LogisticRegression()
clf.fit(x, encoded)
print(clf.coef_)
print(clf.intercept_)

输出结果如下：

[[-0.49124107]
[-0.23528893]
[-0.19035795]
[-0.30312848]
[-0.25783808]
 [-0.37161079]
 [-0.12332468]
 [-0.16797195]
 [-0.05660718]
 [-0.21279785]
 [-0.22142453]
 [-0.10105617]
 [-0.14562868]
 [ 0.00991192]
 [-0.012247  ]
 [ 0.03206243]
 [ 0.07635461]
 [ 0.20951544]
 [ 0.12067417]
 [-0.03441851]
 [ 0.16504852]
 [ 0.09850035]
 [ 0.23179558]
 [ 0.05420914]
 [ 1.47513463]]
[-1.79691975 -2.35677113 -2.35090141 -2.3679202  -2.36017388 -2.38191049
 -2.34441678 -2.34843121 -2.34070389 -2.35368047 -1.57944984 -2.3428732
 -2.3462668  -2.33974088 -2.33975687 -2.34002906 -2.34151792 -2.35329447
 -2.34422478 -2.34007746 -2.34814388 -2.34271603 -2.35632459 -2.34062229
 -1.72511457]

我想了解 S 型函数的参数，该函数通常用于逻辑回归。如何找到 S 型函数的参数（即截距和斜率）？

以下是 S 型函数（如果需要参考）：

def sigmoid(x, x0, k):
     y = 1 / (1 + np.exp(-k*(x-x0)))
     return y

- NAMAN SHUKLA

如果“proportion”是连续变量，我认为针对这个问题，你应该寻找岭回归而不是逻辑回归。 - Gerges

是的，@GergesDib，你说得对。谢谢。但是我现在只是想找出逻辑函数的参数，即使它不是最好的回归模型。任何帮助都将不胜感激。 - NAMAN SHUKLA

我觉得你已经找到了，它们分别是lr.coef_和lr.intercept_。问题出在哪里？ - Gerges

这很奇怪！如果我打印 print(x.shape, y.shape, encoded.shape)，我会得到 (38, 1) (38,) (38,) 这个结果。我不确定如何理解这个结果。 - NAMAN SHUKLA

我看到了.. encoded 携带着类别标签，而由于 y 是连续的，你有38个唯一的标签（每个观测一个），因此你有38个系数（每个类别一个）。如果你用像 np.concatenate([np.ones(19), np.zeros(19)]) 这样的东西替换 encoded，使它看起来像是有2类，那么你将得到一个单一的系数和截距。 - Gerges

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Anton Alekseev · Accepted Answer

这是逻辑回归在解决多项问题时的正常行为。请参考这里：

在多类情况下，训练算法使用一对多（OvR）方案。当问题是二进制时，intercept_的形状为(1,)。

示例：

>>> clf = LogisticRegression()
>>> clf.fit([[1,2], [1,3], [0, 1]], [[0],[1],[0]])
>>> clf.coef_
array([[ 0.02917282,  0.12584457]])
>>> clf.intercept_
array([-0.40218649])
>>> clf.fit([[1,2], [1,3], [0, 1]], [[0],[1],[2]])
>>> clf.coef_
array([[ 0.25096507, -0.24586515],
       [ 0.02917282,  0.12584457],
       [-0.41626058, -0.43503612]])
>>> clf.intercept_
array([-0.15108918, -0.40218649,  0.1536541 ])

实际上，有一些模型旨在解决不同的二元问题。您可以合并第i个系数和第i个截距，然后得到解决第i个二元问题的模型，以此类推，直到列表结尾。