在训练和测试数据中保持相同的虚拟变量

Question

在训练和测试数据中保持相同的虚拟变量

pythondataframescikit-learnpredictiondummy-variable

53

我正在使用Python构建一个预测模型，其中包含两个单独的训练和测试数据集。训练数据包含数值类型的分类变量，例如邮政编码[91521、23151、12355，...]，以及字符串类型的分类变量，例如城市['芝加哥'、'纽约'、'洛杉矶'，...]。

为了训练数据，我首先使用“pd.get_dummies”获取这些变量的虚拟变量，然后使用转换后的训练数据拟合模型。

我对我的测试数据进行相同的变换，并使用训练过的模型预测结果。但是，我遇到了错误。

ValueError: Number of features of the model must  match the input. Model n_features is 1487 and  input n_features is 1345

原因是测试数据中有较少的虚拟变量，因为它具有较少的'city'和'zipcode'。如何解决这个问题？例如，'OneHotEncoder'只会对所有数字类型的分类变量进行编码。 'DictVectorizer（）'仅对所有字符串类型的分类变量进行编码。我在网上搜索并看到了一些类似的问题，但没有一个真正回答了我的问题。

使用scikit-learn处理分类特征

如果训练数据集比测试数据集更多的变量，应该怎么办？

在Python中进行二进制单热和K编码的最佳方法是什么？

- nimning

7个回答

27

假设在训练集和测试集中具有相同的特征名称，您可以从训练集和测试集生成连接数据集，并从连接数据集获取虚拟变量，然后将其分割回训练集和测试集。

您可以按照以下方式操作：

import pandas as pd
train = pd.DataFrame(data = [['a', 123, 'ab'], ['b', 234, 'bc']],
                     columns=['col1', 'col2', 'col3'])
test = pd.DataFrame(data = [['c', 345, 'ab'], ['b', 456, 'ab']],
                     columns=['col1', 'col2', 'col3'])
train_objs_num = len(train)
dataset = pd.concat(objs=[train, test], axis=0)
dataset_preprocessed = pd.get_dummies(dataset)
train_preprocessed = dataset_preprocessed[:train_objs_num]
test_preprocessed = dataset_preprocessed[train_objs_num:]

结果是，您在训练集和测试集中拥有相等数量的特征。

- Eduard Ilyasov

24

未见过的测试数据怎么处理？拼接和重新训练模型吗？这似乎不是一个可行的选项。 - randomSampling

1

@randomSampling，你找到解决方案了吗？如果是的话，能否请您看一下这个问题？ - R overflow

22

train2,test2 = train.align(test, join='outer', axis=1, fill_value=0)

train2和test2具有相同的列。Fill_value表示用于缺失列的值。

- user1482030

在训练数据中，如果列名为“Marital_Status”，则变为“Marital_Status_Single，Marital_Status_Married，Marital_Status_Divorced”，但在测试数据中仍然是“Marital_Status”，并且值为“Single”，那么如何将确切的列“Marital_Status_Single”填充为1，其他两个填充为0。 - hanzgs

1

@hanzgs，虽然已经很晚了，但为了其他人的帮助：在执行训练测试合并之前，请对测试数据进行独热编码，方法是使用"pd.get_dummies(test)"。 - rmswrp

6

在对训练集和测试集都运行get_dummies之后，我以前曾经这样做过。

X_test = X_test.reindex(columns = X_train.columns, fill_value=0)

显然，对于个别情况需要进行一些微调。但是，在这种情况下，它会丢弃测试集中的新值，并填充测试集中缺失的值为零。

- demongolem

4

这是一个比较老的问题，但如果你想使用scikit learn API，可以使用DummyEncoder类：https://gist.github.com/psinger/ef4592492dc8edf101130f0bf32f5ff9。它利用了category dtype来指定要创建哪些虚拟变量，这也在这里有详细说明：Dummy creation in pipeline with different levels in train and test set。

- fsociety

1

对于sklearn >= 0.20，OneHotEncoder现在可以编码字符串数据。

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

X_train = pd.DataFrame({
    'zip' : [23151, 12355],
    'city' : ['New York', 'Los Angeles']
})

X_test = pd.DataFrame({
    'zip' : [91521, 23151],
    'city' : ['Chicago', 'New York']
})

ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # New in version 1.2: sparse was renamed to sparse_output
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)

为了获得一个带有相应列名的干净的数据框（类似于pd.get_dummies），可以执行以下操作：

cols_ohe = ohe.get_feature_names_out()
X_train_ohe = pd.DataFrame(X_train_ohe, columns=cols_ohe)
X_test_ohe = pd.DataFrame(X_test_ohe, columns=cols_ohe)

>>> X_train_ohe 
zip_12355   zip_23151   city_Los Angeles    city_New York
0.0         1.0         0.0                 1.0
1.0         0.0         1.0                 0.0

>>> X_test_ohe 
zip_12355   zip_23151   city_Los Angeles    city_New York
0.0         0.0         0.0                 0.0
0.0         1.0         0.0                 1.0

- Mattravel

0

将邮政编码转换为str

在OneHotEncoder中使用fit_transform()来训练数据，使用transform()来测试数据。

- Gokul Patel

2

你的回答可以通过提供更多支持信息来改进。请编辑以添加进一步的细节，例如引用或文档，以便他人可以确认你的答案是正确的。您可以在帮助中心中找到有关如何编写良好答案的更多信息。 - Community

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thibault Clement · Accepted Answer

您也可以只获取缺失的列并将它们添加到测试数据集中：

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

这段代码也确保测试数据集中由类别所产生的列，但在训练数据集中不存在的列将被删除。