如何使用带有MRF平滑和邻域结构的GAM来预测测试数据?

5

我在使用mgcv::gam(训练)模型的predict()函数时,遇到了问题,因为我已经整合了一个mrf平滑来考虑我的数据的空间性质。

我使用以下调用创建我的GAM模型:

## Run GAM with MRF
m <- gam(crime ~ s(district,k=nrow(traindata),
                 bs ='mrf',xt=list(nb=nbtrain)), #define MRF smooth
     data = traindata,
     method = 'REML', 
     family = scat(), #fit scaled t distribution
     gamma = 1.4
)

我使用邻居结构预测因变量crime,并将其解析到平滑术语参数xt的模型中。邻居结构作为nb对象提供,我使用poly2nb()函数创建该对象。

现在,如果我想在新的测试数据集上使用predict(),我不知道如何将相应的邻居结构传递给调用函数。只提供新数据是不够的。

pred <- predict.gam(m,newdata=testdata)

会抛出以下错误:

Error in predict.gam(m, newdata = testdata) :
7, 16, 20, 28, 35, 36, 37, 43 not in original fit

这是使用Columbus数据集在R中直接调用时出现的完整错误重现:
#ERROR REPRODUCTION

## Load packages
require(mgcv)
require(spdep)
require(dplyr)

## Load Columbus Ohio crime data (see ?columbus for details and credits)
data(columb.polys) #Columbus district shapes list
columb.polys <- lapply(columb.polys,na.omit) #omit NAs (unfortunate problem with the Columbus sample data)
data(columb) #Columbus data frame

df <- data.frame(district=numeric(0),x=numeric(0),y= numeric(0)) #Create empty df to store x, y and IDs for each polygon

## Extract x and y coordinates from each polygon and assign district ID
for (i in 1:length(columb.polys)) {
  district <- i-1
  x <- columb.polys[[i]][,1]
  y <- columb.polys[[i]][,2]
  df <- rbind(df,cbind(district,x,y)) #Save in df data.frame
}

## Convert df into SpatialPolygons
sp <- df %>%
       group_by(district) %>%
       do(poly=select(., x, y) %>%Polygon()) %>%
       rowwise() %>%
       do(polys=Polygons(list(.$poly),.$district)) %>%
       {SpatialPolygons(.$polys)}

## Merge SpatialPolygons with data
spdf <- SpatialPolygonsDataFrame(sp,columb)

## Split into training and test sample (80/20 ratio)
splt <- sample(1:2,size=nrow(spdf),replace=TRUE,prob=c(0.8,0.2))
train <- spdf[splt==1,] 
test <- spdf[splt==2,]

## Prepapre both samples and create NB objects
traindata <- train@data #Extract data from SpatialPolygonsDataFrame
testdata <- test@data
traindata <- droplevels(as(train, 'data.frame')) #Drop levels
testdata <- droplevels(as(test, 'data.frame'))
traindata$district <- as.factor(traindata$district) #Factorize
testdata$district <- as.factor(testdata$district)
nbtrain <- poly2nb(train, row.names=train$Precinct, queen=FALSE) #Create NB objects for training and test sample
nbtest <- poly2nb(test, row.names=test$Precinct, queen=FALSE)
names(nbtrain) <- attr(nbtrain, "region.id") #Set region.id
names(nbtest) <- attr(nbtest, "region.id")

## Run GAM with MRF
m <- gam(crime ~ s(district, k=nrow(traindata), bs = 'mrf',xt = list(nb = nbtrain)), # define MRF smooth
         data = traindata,
         method = 'REML', # fast version of REML smoothness selection; alternatively 'GCV.Cp'
         family = scat(), #fit scaled t distribution
         gamma = 1.4
)

## Run prediction using new testing data
pred <- predict.gam(m,newdata=testdata)

1
我还没有深入研究这个问题;根据几年前在R-Help上的这个讨论,Simon Wood建议使用完整数据拟合模型,但对训练观测值使用零权重。 - Gavin Simpson
当我说训练观察应该有零权重时,我的意思是测试观察应该有零权重。抱歉! - Gavin Simpson
1
请注意,我甚至不确定手动构建惩罚矩阵是否有效。尝试向数据添加一个指示行是训练还是测试样本的向量“ind”,然后添加权重向量“wt <- ind / mean(ind)”,然后在您的“gam()”模型中传递“weights = wt”。这将标准化权重,因此不会改变模型的对数似然。 - Gavin Simpson
1
@GavinSimpson 首先,感谢您的帮助!我已经按照您的解决方案使用了归一化权重,但是它会产生一个警告: Warning messages: 1: In gam.fit4(x, y, sp, Eb, UrS = UrS, weights = weights,...:Non-finite coefficients at iteration 2 13: In newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS,... : Fitting terminated with step failure - check results carefully。预测结果跳到了100%的解释方差。但是,如果我使用非归一化权重ind,它就可以工作。我们真的需要在这里进行归一化吗? - Konstantin Klemmer
1
@KonstantinKlemmer 噢,对了;我认为在这种情况下你需要非标准化权重,因为你希望没有数据的位置具有0权重。(我默认使用标准化,因为最近一直在使用带权重的GAM进行拟合,如果不进行标准化,我会得到过小的标准误差。)很高兴你已经有一个可用的版本! - Gavin Simpson
显示剩余16条评论
2个回答

3

解决方案:

我最终花时间更新了这篇文章并提供了解决方案。感谢大家的帮助。以下是实现使用随机训练测试拆分的k-fold CV的代码:

#Apply k-fold cross validation
mses <- data.frame() #Create empty df to store CV squared error values
scores <- data.frame() #Create empty df to store CV R2 values
set.seed(42) #Set seed for reproducibility
k <- 10 #Define number of folds
for (i in 1:k) {
  # Create weighting column
  data$weight <- sample(c(0,1),size=nrow(data),replace=TRUE,prob=c(0.2,0.8)) #0 Indicates testing sample, 1 training sample

  #Run GAM with MRF
  ctrl <- gam.control(nthreads = 6) #Set controls
  m <- gam(crime ~ s(disctrict, k=nrow(data), bs = 'mrf',xt = list(nb = nb)), #define MRF smooth
            data = data,
            weights = data$weight, #Use only weight==1 observations (training)
            method = 'REML', 
            control = ctrl,
            family = scat(), 
            gamma = 1.4
           )
  #Generate test dataset
  testdata <- data[data$weight==0,] #Select test data by weight
  #Predict test data
  pred <- predict(m,newdata=testdata)
  #Extract MSES
  mses[i,1] <- mean((data$R_MeanDiff[data$weight==0] - pred)^2)
  scores[i,1] <- summary(m)$r.sq
}
av.mse.GMRF <- mean(mses$V1)
av.r2.GMRF <- mean(scores$V1)

1
我有一个问题对当前解决方案进行批评,即使用了整个数据集来“训练”模型,这意味着预测结果会存在偏差,因为测试数据用于训练。
这只需要进行一些小的调整即可解决:
#Apply k-fold cross validation
mses <- data.frame() #Create empty df to store CV squared error values
scores <- data.frame() #Create empty df to store CV R2 values
set.seed(42) #Set seed for reproducibility
k <- 10 #Define number of folds

#For loop for each fold
for (i in 1:k) {

  # Create weighting column
  data$weight <- sample(c(0,1),size=nrow(data),replace=TRUE,prob=c(0.2,0.8)) #0 Indicates testing sample, 1 training sample

  #Generate training dataset
  trainingdata <- data[data$weight == 1, ] #Select test data by weight  

  #Generate test dataset
  testdata <- data[data$weight == 0, ] #Select test data by weight


  #Run GAM with MRF
  ctrl <- gam.control(nthreads = 6) #Set controls
  m <- gam(crime ~ s(disctrict, k=nrow(data), bs = 'mrf',xt = list(nb = nb)), #define MRF smooth
            data    = trainingdata,
            weights = data$weight, #Use only weight==1 observations (training)
            method  = 'REML', 
            control = ctrl,
            family  = scat(), 
            gamma   = 1.4
           )

  #Predict test data
  pred <- predict(m,newdata = testdata)

  #Extract MSES
  mses[i,1] <- mean((data$R_MeanDiff[data$weight==0] - pred)^2)
  scores[i,1] <- summary(m)$r.sq
}

#Get average scores from each k-fold test
av.mse.GMRF <- mean(mses$V1)
av.r2.GMRF <- mean(scores$V1)


是的,整个数据集都用于训练,但是我们将所有不在训练集中的数据的“权重”设置为0。或者我错过了什么? - Konstantin Klemmer
是的,我确实漏掉了一些东西。我的错,感谢@adam的回复! - Konstantin Klemmer

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接