我正在收集调查数据(使用开放数据工具包),我的野外团队有时会在人名拼写上稍微有些创意。因此,我有一个“正确”的受访者姓名,以及一些记录与“家庭成员姓名”变量相关联的年龄变量。有许多不同年龄的家庭成员。我想知道受访者的年龄。
以下是一些虚假数据,说明了我的问题:
#the respondent
r = data.frame(name = c("Barack Obama", "George Bush", "Hillary Clinton"))
#a male member
m = data.frame(name = c("Barack Obama","George", "Wulliam Clenton"), age = c(55,59,70)); m$name=as.character(m$name)
#a female member
f = data.frame(name = c("Michelle O","Laura Busch", "Hillary Rodham Clinton"), age = c(54,58,69)); f$name=as.character(f$name)
#if the responsent is the the given member, record their age. if not, NA
a = cbind(
ifelse(r$name==m$name,m$age,NA)
,ifelse(r$name==f$name,f$age,NA)
)
#make a function for plyr that gives me the age of the matched respondent
f = function(row){
d = row[is.na(row)==0]
ifelse(length(d)==0,NA,d)
}
require(plyr)
b = aaply(a,.margins=1,.fun=f)
data.frame(names=r$name,age=b)
names age
1 Barack Obama 55
2 George Bush NA
3 Hillary Clinton NA
what.I.would.like = data.frame(names=c("Barack Obama", "George Bush", "Hillary Clinton"),age = c(55,59,70))
1> what.I.would.like
names age
1 Barack Obama 55
2 George Bush 59
3 Hillary Clinton 70
在我的真实数据中,我有数百人和多达13个家庭成员。我已经改变了调查方式来分别记录受访者的年龄,但我需要清理一堆数据。