仅当在MICE中满足特定条件时,如何填补缺失值?

4
我正在使用MICE处理一个数据集,并遇到了麻烦。有一个变量明显与另一个变量有关,我无法弄清如何让MICE仅填补一个变量中的一些缺失值(并将其余缺失值保留为真正的缺失值)。
例如,我有一个关于性别、怀孕状态和结果的数据集。只有女性能怀孕,所以当“怀孕”缺失但主体是男性时,我不想在那里填补值。
但是我确实希望在女性的怀孕状态缺失时填补值。所有变量(包括性别和结果)都有一些缺失值。
我已经阅读了这里的建议,并尝试在MICE中使用'where'选项 'R', 'mice'缺失变量插入-如何在稀疏矩阵中仅执行一列。但是使用'where'似乎没有填补所有性别结果?
例如:
library(mice)
library(tidyverse)
library(haven)
library(janitor)

# create some data
sex <- c("m","f","m","f","m",NA,NA,"f","f","m","f","f","m","m","f","m")
preg <- c(NA,"not_preg",NA,NA,NA,NA,"preg","not_preg",NA,"not_preg","preg",NA,NA,NA,NA) 
outcome <- c(1,0,1,0,0,NA,NA,0,0,1,0,1,1,0,0)
df <- cbind(sex,preg,outcome) %>% as_tibble() %>% mutate(sex=as_factor(sex)) %>% mutate(preg=as_factor(preg))

# look at what's missing
md.pattern(df)
df %>% tabyl(sex,preg)
df %>% tabyl(preg)

# Try to impute over everything to show mice working
mice_a <- mice(df, m=2, maxit=2, seed=3,method="pmm")
df_imp_a <- complete(mice, action="long", include = FALSE)

df_imp_a %>% filter(.imp==1) %>% tabyl(sex,preg)  # this has imputed that some men are pregnant (understandably,but not what I want!
df_imp_a %>% filter(.imp==1) %>% tabyl(sex) #but everyone has a sex imputed
df_imp_a %>% filter(.imp==1) %>% tabyl(preg)

# Try to use the 'where' option

# b. Using it with a 'blank' where as proof of principle

grid_b <- is.na(df) #this is just default
mice_b <- mice(df, m=2, maxit=2, seed=3,method="pmm",where=grid_b)
df_imp_b <- complete(mice_b, action="long", include = FALSE)
df_imp_b %>% filter(.imp==1) %>% tabyl(sex,preg) #same problem of pregnant men (obviously, haven't changed anything yet)
df_imp_b %>% filter(.imp==1) %>% tabyl(sex) # but at least everyone has a sex imputed
df_imp_b %>% filter(.imp==1) %>% tabyl(preg)

# c. Making a proper grid of data that I do and don't want imputed

grid_c <- df %>%
  mutate(preg=case_when(
    sex=="f" & is.na(preg)==TRUE ~ TRUE,
    TRUE ~ FALSE
  )) %>%
  mutate(sex=is.na(sex)) %>%
  mutate(outcome=is.na(outcome))

grid_c
grid_c %>% tabyl(preg) # so we are looking for 4 imputed values of 'preg' (so I've done it right -- there are 4 females with unknown pregnancy status)

mice_c <- mice(df,m=2,maxit=2,seed=3,method="pmm",where=grid_c)
df_imp_c <- complete(mice_c,action="long",include=FALSE)

df_imp_c %>% filter(.imp==1) %>% tabyl(sex,preg) # now I have no pregnant men -- which is good!
df_imp_c %>% filter(.imp==1) %>% tabyl(sex) # but I am missing sex for one person??
df_imp_c %>% filter(.imp==1) %>% tabyl(preg) # have imputed all the pregnancy data that I wanted through -- only 7 NAs (for the 7 men)

如何告诉程序仅对某一列中的特定行进行填充,而不是全部进行填充?同时需要对另一列的所有行进行填充。使用“where”选项时为什么没有按我预期的那样工作?

非常感谢您提供的帮助!谢谢。

2个回答

1

我曾经遇到过类似的问题,如果年龄小于15岁,我不想在70多列中填充单元格。以下简短的代码非常有帮助。

在你的mice()代码中包含where=miss.infor.data。

#copy your dataset
    df2 <- df 

# Set missing cells to 100 in columns 248 to 320 for those over age 15
    df2[df2$age < 15, 248:320] <- 100 

#create the logical in which those with a value 100 are not set to TRUE so they will not be imputed in the where option.
    miss.infor.data <-as.data.frame(lapply(AddedValuedat2, is.na)) 


0
经过大量实验,似乎 mice 存在一个问题,即不允许在 sex 为 NA 的情况下填补 preg。如果您将 grid_c 设置如下,则似乎可以解决这个问题:
grid_c <- df %>%
  mutate(preg=case_when(
    (sex=="f"|is.na(sex)) & is.na(preg)==TRUE ~ TRUE,
    TRUE ~ FALSE
  )) %>%
  mutate(sex=is.na(sex)) %>%
  mutate(outcome=is.na(outcome))

请注意(sex=="f"|is.na(sex))的变化。

这样做的缺点是你会得到一些not_preg的男性。虽然从技术上讲是正确的,但你可能也希望将他们设置为NA。因此,你可以在后处理中这样做。或者,在插补之前通过向preg添加另一类别来避免这个问题,该类别编码男性不能怀孕(而不是NA)。

现在,如果你按照上述方法操作,你将遇到第二个问题:现在你的outcome中仍有NAs。这似乎是因为你的测试数据没有足够的信息来插补所有缺失值。请注意,你的第6行根本没有任何信息,因此mice没有关于那个人的数据可以输入到pmm算法中。
如果你包括可能包含有关缺失值信息的其他变量(并且对于所涉及的人/行不是NA),则可以解决此问题。如果你没有这些数据,则排除该人,因为你实际上没有将其纳入样本中。

最后,如果您继续使用类似于您的示例数据测试该过程,请将其设置为更长的数据集。在进行实验时,我收到了更多警告(在mice_c$loggedEvents中),这是由于某些分类变量组合的案例数量较少所致。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接