fastai表格模型 - 如何为新数据获取预测？

Question

fastai表格模型 - 如何为新数据获取预测？

6

我正在使用Kaggle房价数据集，它分为：训练集和测试集

我使用train集构建了一个fastai表格模型
如何预测测试集的值？

我知道这听起来很简单，大多数其他库会像model.predict（test）一样做到这一点，但这里并不是这种情况。我在fastai论坛、SO和文档中搜索过。关于这个问题有相当多的话题，其中大部分要么没有答案，要么是过时的解决方法（因为最近发布了fastai2，现在称为只有fastai）。

a. model.predict仅适用于单行，而循环测试不是最优选择。它非常慢。

b. model.get_preds为您训练的数据提供结果

请建议如何使用训练后的学习器预测表格数据的新df。

- Oleg Peregudov

2个回答

2

我发现了一个问题。对于未来的读者——为什么无法让get_preds在新的df上工作？

（在kaggle的房价高级中进行了测试）

问题的根源在于分类NaN。如果您使用一组cat特征（例如颜色=红色，绿色，蓝色）训练模型，并且您的新df具有颜色：红色，绿色，蓝色，黑色-它将抛出错误，因为它不知道如何处理新类（黑色）。更不用说你需要在每个地方都有相同的列，这可能会很棘手，因为如果你像我一样使用fillmissing proc，它会为cat值（缺失或不缺失）创建新的列。所以你需要三倍检查这些cats中的NaN。

我真的想让它从头到尾都能在fastai中工作：

训练/测试的列是相同的，只有训练有1个额外的目标列。此时，在某些cat cols中有不同的类。我决定将它们合并（只是为了让它起作用），但这是否引入了泄漏？

combined = pd.concat([train, test]) # test will have nans at target, but we don't care
cont_cols, cat_cols = cont_cat_split(combined, max_card=50)
combined = combined[cat_cols]

一些微调，顺便说一下。

train[cont_cols] = train[cont_cols].astype('float') # if target is not float, there will be an error later
test[cont_cols[:-1]] = test[cont_cols[:-1]].astype('float'); # slice target off (I had mine at the end of cont_cols)

成功进入Tabular Panda

procs = [Categorify, FillMissing]

to = TabularPandas(combined,
                   procs = procs,
                   cat_names = cat_cols)

train_to_cat = to.items.iloc[:train.shape[0], :] # transformed cat for train
test_to_cat = to.items.iloc[train.shape[0]:, :] # transformed cat for test. Need to separate them

to.items将为我们提供转换后的类别列。之后，我们需要将所有内容重新组合在一起。

train_imp = pd.concat([train_to_cat, train[cont_cols]], 1) # assemble new cat and old cont together
test_imp = pd.concat([test_to_cat, test[cont_cols[:-1]]], 1) # exclude SalePrice

train_imp['SalePrice'] = np.log(train_imp['SalePrice']) # metric for kaggle

之后，我们按照fastai教程操作。

dep_var = 'SalePrice'
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter(valid_pct=0.2)(range_of(train_imp))

to = TabularPandas(train_imp, 
                   procs = procs,
                   cat_names = cat_cols,
                   cont_names = cont_cols[:-1], # we need to exclude target
                   y_names = 'SalePrice',
                   splits=splits)

dls = to.dataloaders(bs=64)

learn = tabular_learner(dls, n_out=1, loss_func=F.mse_loss)
learn.lr_find()

learn.fit_one_cycle(20, slice(1e-2, 1e-1), cbs=[ShowGraphCallback()])

此时，我们已经有了一个学习者，但仍无法进行预测。我认为在我们完成以下步骤后：

dl = learn.dls.test_dl(test_imp, bs=64)
preds, _ = learn.get_preds(dl=dl) # get prediction

它只会起作用（对 cont 值和预测的预处理），但是它不会填充 NaN。所以只需在测试中查找并填充 NaN:

missing = test_imp.isnull().sum().sort_values(ascending=False).head(12).index.tolist()
for c in missing:
    test_imp[c] = test_imp[c].fillna(test_imp[c].median())

之后我们终于可以预测：

dl = learn.dls.test_dl(test_imp, bs=64)
preds, _ = learn.get_preds(dl=dl) # get prediction

final_preds = np.exp(preds.flatten()).tolist()

sub = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
sub.SalePrice = final_preds

filename = 'submission.csv'
sub.to_csv(filename, index=False)

抱歉叙述有些冗长，但我对编程相对较新，这个问题很难指出。网上关于如何解决它的信息非常少。简而言之，这是一种痛苦。

不幸的是，这仍然是解决问题的一个变通方法。如果任何特征中的类别数量在测试中不同，它会出现异常。同时奇怪的是，在将测试数据集适配到dls时它没有填充fillna值。

如果您有任何愿意分享的建议，请告诉我。

- Oleg Peregudov

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- manju-dev · Accepted Answer

model.get_preds 用于在未见过的数据上进行批量预测。您只需要对这些新数据应用与训练数据相同的转换即可。

dl = model.dls.test_dl(test_data, bs=64) # apply transforms
preds,  _ = model.get_preds(dl=dl) # get prediction

fastai 的论坛非常活跃，你可能会得到库开发者的回应，因此未来也可以尝试在那里提问。