我正在使用pandas的get_dummies
将分类变量转换为虚拟/指示变量,它在数据集中引入了新特征。然后我们将这个数据集拟合/训练到一个模型中。
由于X_train
和X_test
的维度保持不变,因此当我们对测试数据进行预测时,它能够与测试数据X_test
很好地配合使用。
现在假设我们有另一个csv文件中的测试数据(具有未知输出)。当我们使用get_dummies
转换这组测试数据时,生成的数据集可能与我们训练模型时使用的特征数量不同。稍后当我们使用我们的模型处理这个数据集时,它会失败,因为测试集中的特征数与模型的特征数不匹配。
有什么想法如何处理这个问题吗?
代码:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load the dataset
in_file = 'train.csv'
full_data = pd.read_csv(in_file)
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)
X_train, X_test, y_train, y_test = train_test_split(features, outcomes,
test_size=0.2, random_state=42)
model =
DecisionTreeClassifier(max_depth=50,min_samples_leaf=6,min_samples_split=2)
model.fit(X_train,y_train)
y_train_pred = model.predict(X_train)
#print (X_train.shape)
y_test_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
# DOing again to test another set of data
test_data = 'test.csv'
test_data1 = pd.read_csv(test_data)
test_data2 = pd.get_dummies(test_data1)
test_data3 = test_data2.fillna(0.0)
print(test_data2.shape)
print (model.predict(test_data3))