机器学习 - 测试集特征数少于训练集

Question

机器学习 - 测试集特征数少于训练集

pythonmachine-learning

3

大家好。我正在开发一个机器学习模型，有一个问题想请教。假设我的训练数据如下：

ID | 动物 | 年龄 | 栖息地

0 | 鱼 | 2 | 海洋

1 | 鹰 | 1 | 山地

2 | 鱼 | 3 | 海洋

3 | 蛇 | 4 | 森林

如果我应用独热编码（One-hot Encoding），它将生成以下矩阵：

ID | 动物_鱼 | 动物_鹰 | 动物_蛇 | 年龄 | ...

0 | 1 | 0 | 0 | 2 | ...

1 | 0 | 1 | 0 | 1 | ...

2 | 1 | 0 | 0 | 3 | ...

3 | 0 | 0 | 1 | 4 | ...

这很漂亮，在大多数情况下都能正常工作。但是，如果我的测试集包含的特征比训练集少（或多）怎么办？如果我的测试集不包含“鱼”怎么办？它将生成一种少的类别。

你们能帮我解决这个问题吗？

谢谢。

- Paulo Henrique Vasconcellos

朴素贝叶斯算法对此有解决方案。您只需忽略“缺失值”，当您有额外的数据时，您也不会对其进行建模。 - Adam

然后Fish功能应该一直向下有零。 - blacksite

0并不总是等同于缺失值。 - Adam

1

测试集通常应该是输入数据集的子集，并且具有相同的特征。对于您希望算法进行预测的实时数据也是如此。一些算法对缺失特征具有更高的容忍度（例如随机森林），但是根据特征的重要性，它将影响预测性能。 - miraculixx

@Adam 不错的观点，但我认为在这种情况下并不重要。如果您的训练集有某个离散变量的3个类别（例如“鱼”，“狗”和“猫”），而您的测试集只有两个类别（例如“鱼”和“狗”），那么将“猫”特征添加到测试集中并用零填充是否最容易？如果您在测试集中没有“看到”猫，但在训练集中确实有，则应考虑该猫特征。任何数据集中缺少的组都不排除该组特征不存在的可能性。 - blacksite

显示剩余3条评论

2个回答

0

训练集确定了您可以用于识别的特征。如果您很幸运，您的识别器将只忽略未知的特征（我相信NaiveBayes会这样做），否则您将收到一个错误。因此，请保存训练过程中创建的特征名称集，并在测试/识别过程中使用它们。

有些识别器将把缺失的二元特征视为零值。我相信这就是NLTK的NaiveBayesClassifier所做的，但其他引擎可能具有不同的语义。因此，对于存在/不存在的二元特征，我会编写我的特征提取函数，以便始终将相同的键放入特征字典中。

- alexis

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- blacksite · Accepted Answer

听起来您已经完全将训练集和测试集分开了。以下是一个最简示例，展示如何自动添加“缺失”的特征到给定的数据集中：

import pandas as pd

# Made-up training dataset
train = pd.DataFrame({'animal': ['cat', 'cat', 'dog', 'dog', 'fish', 'fish', 'bear'],
                      'age': [12, 13, 31, 12, 12, 32, 90]})

# Made-up test dataset (notice how two classes are from train are missing entirely)
test = pd.DataFrame({'animal': ['fish', 'fish', 'dog'],
                      'age': [15, 62, 1]})

# Discrete column to be one-hot-encoded
col = 'animal'

# Create dummy variables for each level of `col`
train_animal_dummies = pd.get_dummies(train[col], prefix=col)
train = train.join(train_animal_dummies)

test_animal_dummies = pd.get_dummies(test[col], prefix=col)
test = test.join(test_animal_dummies)

# Find the difference in columns between the two datasets
# This will work in trivial case, but if you want to limit to just one feature
# use this: f = lambda c: col in c; feature_difference = set(filter(f, train)) - set(filter(f, test))
feature_difference = set(train) - set(test)

# create zero-filled matrix where the rows are equal to the number
# of row in `test` and columns equal the number of categories missing (i.e. set difference 
# between relevant `train` and `test` columns
feature_difference_df = pd.DataFrame(data=np.zeros((test.shape[0], len(feature_difference))),
                                     columns=list(feature_difference))

# add "missing" features back to `test
test = test.join(feature_difference_df)

test从这里开始：

   age animal  animal_dog  animal_fish
0   15   fish         0.0          1.0
1   62   fish         0.0          1.0
2    1    dog         1.0          0.0

变为：

   age animal  animal_dog  animal_fish  animal_cat  animal_bear
0   15   fish         0.0          1.0         0.0          0.0
1   62   fish         0.0          1.0         0.0          0.0
2    1    dog         1.0          0.0         0.0          0.0

假设每行（每个动物）只能是一个动物，那么我们可以添加一个“动物熊”特征（一种“是熊”的测试/特征），因为假设如果在“测试”中有任何熊的信息，那么这些信息将被纳入“动物”列。

作为建模/训练时的经验法则，尽可能考虑所有可能的特征（例如，所有可能的“动物”值）。如评论中所提到的，有些方法比其他方法更擅长处理缺失数据，但如果您从一开始就可以做到这一点，那可能是一个好主意。现在，如果您接受免费文本输入（因为可能的输入数量是无限的），那么这将是很困难的。