如何在Pandas和sklearn中将预测值合并回原始DataFrame？

Question

如何在Pandas和sklearn中将预测值合并回原始DataFrame？

3

第一次尝试使用sklearn和pandas，如果这是一个基础问题，请原谅。这是我的代码:

import pandas as pd
from sklearn.linear_model import LogisticRegression

X = df[predictors]
y = df['Plc']

X_train = X[:int(X.shape[0]*0.7)]
X_test = X[int(X.shape[0]*0.7):]
y_train = y[:int(X.shape[0]*0.7)]
y_test = y[int(X.shape[0]*0.7):]


model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
result = model.score(X_test, y_test)
print("Accuracy: %.3f%%" % (result*100.0))

现在我希望的是将预测值放回到原始的df中，这样我就可以查看实际df['Plc']列和y_test的预测值之间的差异。我已经尝试过了，但感觉这可能不是最好的方法，并且索引号没有像预期的那样对齐。

y_pred = pd.DataFrame()
y_pred['preds'] = model.predict(X_test)
y_test = pd.DataFrame(y_test)
y_test['index1'] = y_test.index
y_test = y_test.reset_index()
y_test = pd.concat([y_test,y_pred],axis=1)
y_test.set_index('index1')
df = df.reset_index()
df_out = pd.merge(df,y_test,how = 'inner',left_index = True, right_index = True)

有什么其他的建议吗？谢谢！

- SOK

3个回答

2

我相信您想将X_test、y_test和y_pred合并到同一个数据框中（因为在这里没有使用X_train）。我认为可以使用Pandas的train_test_split轻松保留索引（虽然也有一种方法可以使用numpy Scikit-learn train_test_split with indices）。在这里，我将使用鸢尾花作为玩具数据，但您可以理解这个思路。

from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
X = pd.DataFrame(X)
y = pd.Series(y)
### you can use shuffle = False instead of random if it's needed
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
df = X_test.copy()
df['Plc']= y_test
df.reset_index(inplace=True)
df['pred'] = model.predict(X_test)

## then print df, you can remove the index of the original df if you like

如果您真的想合并X_train、y_train，并且在pred列中有NaN值，您可以以同样的方式合并X_train和y_train，然后使用pd.concat将它们合并成一个数据框。

df2 = X_train.copy()
df2['Plc'] = y_train
df2.reset_index(inplace=True)
pd.concat([df,df2])

index   0   1   2   3   Plc pred
0   73  6.1 2.8 4.7 1.2 1   1.0
1   18  5.7 3.8 1.7 0.3 0   0.0
2   118 7.7 2.6 6.9 2.3 2   2.0
3   78  6.0 2.9 4.5 1.5 1   1.0
4   76  6.8 2.8 4.8 1.4 1   1.0
... ... ... ... ... ... ... ...
100 71  6.1 2.8 4.0 1.3 1   NaN
101 106 4.9 2.5 4.5 1.7 2   NaN
102 14  5.8 4.0 1.2 0.2 0   NaN
103 92  5.8 2.6 4.0 1.2 1   NaN
104 102 7.1 3.0 5.9 2.1 2   NaN
150 rows × 7 columns

- porra

感谢您的建议@porra。我最终使用了FBruzzesi的解决方案，但同样理解您的解决方案，非常感谢！ - SOK

1

由于您的X_test对应于X_test = X[int(X.shape[0]*0.7):]，即您样本的最后30％，因此您可以将预测结果添加到原始数据帧的较低30％部分：

Z=model.predict(X_test)
df.loc[int(X.shape[0]*0.7):,'predictions']=Z

在这里，我们有一个名为“prediction”的新列在df中。如果您的数据框如下所示：

df=pd.DataFrame({'predictor1':[0.1,0.3,0.3,0.3,0.5,0.9,0.02,0.8,0.8,0.75],
             'predictor2':[0.1,0.4,0.4,0.5,0.5,0.9,0.02,0.8,0.8,0.75],
        'Plc':np.array([0,1,1,1,1,1,1,0,1,1])})
predictor=['predictor1','predictor2']

它会给你结果：

   predictor1  predictor2  Plc  predictions
0        0.10        0.10    0          NaN
1        0.30        0.40    1          NaN
2        0.30        0.40    1          NaN
3        0.30        0.50    1          NaN
4        0.50        0.50    1          NaN
5        0.90        0.90    1          NaN
6        0.02        0.02    1          NaN
7        0.80        0.80    0          1.0
8        0.80        0.80    1          1.0
9        0.75        0.75    1          1.0

在最后3个样本中添加了Z=[1,1,1]。

- tianlinhe

1

非常感谢！实际上我先尝试了FBruzzesi的评论，它做到了我想要的，但这个也可以只包含预测结果。非常感谢！ - SOK

嗨@tianlinhe，我刚刚尝试再次运行你的代码以获取特定的行，但是我一直在收到这个错误："Must have equal len keys and value " ValueError: Must have equal len keys and value when setting with an iterable，具体出现在这一行：df.loc[int(X.shape[0]*0.7):,'predictions']=Z。有什么想法吗？谢谢！ - SOK

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- FBruzzesi · Accepted Answer

您可以在不创建其他数据框的情况下，“即兴”定义df中的“preds”列：

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Generate fake data
df = pd.DataFrame(np.random.rand(1000, 4),
                  columns = list('abcd'))
df['Plc'] = np.random.randint(0,2,1000)

# Split X and y
predictors = list('abcd')
X = df[predictors]
y = df['Plc']

# Split train and test
train_size = int(X.shape[0]*0.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]

# Train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict train and test
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

现在您至少有两个选择：

堆叠预测并基于堆叠的数组创建列：

df['preds'] = np.hstack([y_pred_train, y_pred_test])

初始化列，然后分配预测值：

df['preds'] = np.nan
df.loc[:train_size-1, 'pred'] = y_pred_train
df.loc[train_size:, 'pred'] = y_pred_test

它们产生相同的结果。