如何处理在预测时未被训练的水平数据？

Question

如何处理在预测时未被训练的水平数据？

4

我有一组包含结果列的CSV用于训练，还有一组没有结果列的测试CSV。

library(h2o)
h2o.init()

train <- read.csv(train_file, header=T)
train.h2o <- as.h2o(train)
y <- "Result"
x <- setdiff(names(train.h2o), y)

model <- h2o.deeplearning(x = x,
                          y = y,
                          training_frame = train.h2o,
                          model_id = "my_model",
                          epochs = 5000,
                          hidden = c(50),
                          stopping_rounds=5,
                          stopping_metric="misclassification", 
                          stopping_tolerance=0.001,
                          seed = 1)



test <- read.csv(test_file, header=T)
test.h2o <- as.h2o(test)

pred <- h2o.predict(model,test.h2o)

当我尝试用测试数据预测结果时，我会得到一堆错误信息，例如：

1: In doTryCatch(return(expr), name, parentenv, handler) :
Test/Validation dataset column 'ColumnName' has levels not trained on: [ABCD, BCDE]

H2O曾经可以处理测试数据中存在但训练数据中不存在的数据。我在网上找到了一些帖子，他们说他们可以做到。但是对于我来说并没有起作用。

如何避免这些错误，并预测测试数据的值？

- user7792598

1

你能否使用公开可用的数据集将其转化为可重现的示例？H2O应该忽略新级别，所以我不知道这里发生了什么（我以前没有见过这个错误），如果这是一个错误，我们希望能够重现它以便修复。谢谢。 - Erin LeDell

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sixiang.Hu · Accepted Answer

有两种方法可以尝试：

使用`因子`而不是`字符`

在将数据输入机器学习函数之前，您可以将训练和测试数据组合，并将字符变量转换为因子。

因此，即使稍后拆分组合数据，唯一的值也将被记录为级别信息。

library(h2o)

h2o.init()

#using dummy data as combined training and testing data
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")

#assuming GLEASON is the character variable, and transform it to factor
prostate.hex$GLEASON <- h2o.asfactor(prostate.hex$GLEASON)

#split data such that 0,4,5,8 only in test set, and not in train set.
h2o.test <- prostate.hex[prostate.hex$GLEASON %in% c("0","4","5","8"),]
h2o.train <- prostate.hex[!prostate.hex$GLEASON %in% c("0","4","5","8"),]

#train model
model <- h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS","GLEASON"), training_frame = h2o.train,
       family = "binomial", nfolds = 0)

#predict without error
pred <- predict(model,h2o.test)

明确使用`one-hot-encoding`

我知道h2o机器学习函数提供了内部编码方法（通过参数），包括一种称为one-hot-encoding的方法，它将字符变量转换为许多1/0整数变量。

与隐式地使用此技术相反，您可以明确地使用它。因此，在训练中不存在的级别不会在模型中使用。测试中的新级别仅不用于预测。

如何处理在预测时未被训练的水平数据？

使用因子而不是字符

明确使用one-hot-encoding

使用`因子`而不是`字符`

明确使用`one-hot-encoding`