我们能否通过接受（或忽略）新特征来使机器学习模型（pickle文件）更加健壮？

Question

我们能否通过接受（或忽略）新特征来使机器学习模型（pickle文件）更加健壮？

pythonpandasmachine-learningscikit-learnpickle

7

我已经训练了一个机器学习模型，并将其存储在Pickle文件中。
在我的新脚本中，我正在读取新的“真实世界数据”，并希望进行预测。

但是，我遇到了困难。我有一列（包含字符串值），如下：

Sex       
Male       
Female
# This is just as example, in real it is having much more unique values

现在出现了问题。我收到了一个新的（独特的）值，现在我不能再做预测了（例如添加了'Neutral'）。

由于我正在将'Sex'列转换为虚拟变量，所以我的模型不再接受输入，

模型的特征数量必须与输入相匹配。模型n_features为2，输入n_features为3

因此我的问题是：是否有一种方法可以使我的模型更加健壮，并忽略这个类？但是进行预测，没有具体信息？

我尝试过的：

df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))

# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')

# Checking for missing columns, and adding that to the new dataset 
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
    df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)

# make sure that we have the same order 
df = df[example_df.columns] 

# The prediction will lead to an error!
results = model.predict(df)

# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y

注意，我搜索了但是没有找到任何有用的解决方案（不是这里，这里或者这里）更新还发现了这篇文章。但是同样的问题在这里...我们可以使用与训练集相同的列来创建测试集...但是对于新的真实世界数据（例如新值“中性”）怎么办？

- R overflow

如果您过滤掉带有“Neutral”的条目，其他条目是否会生成无误的预测？ - rickhg12hs

嗨Rick，是的。由于该列被转换为虚拟列，我们有一个名为'Sex_Male'、'Sex_Female'的列。看起来模型接受一行，其中两个值都为0。 - R overflow

一个快速解决方案（虽然不是很推荐）是在你的训练数据中创建另一个类别作为“其他”，并可能使用你的数据集生成一些其他特征的人工数据。当你在“性别”特征中得到除“男性”或“女性”之外的任何内容时，你可以将其预处理为“其他”，并输入模型。然而，这不是一个好的方法，因为它不能很好地捕捉到预期的东西，并且可能会对模型性能产生负面影响。更简单和可靠的方法是将这些名义特征固定，并且不接受“其他”，考虑“性别”。 - null

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Venkatachalam · Accepted Answer

是的，在训练完成后无法将新类别或特征包含（更新模型）到数据集中。 OneHotEncoder 可能会处理测试数据中某个特征内有新类别的问题。它将负责确保您的训练数据和测试数据在分类变量方面保持列的一致性。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
                   'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))

model = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])],
                                       remainder='passthrough')),
                  ('lr', LogisticRegression())])

model.fit(df, target)

# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
                        'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes'], dtype=object)