在R中按组匹配并计算序列中的值

3

这是我的数据:

group <- c(1,1,1,1,2,2,2,3,3,4,4,4,4)
X1 <- c("A","A","A","A","B","A","B","A","A","B","B","B","B")
X2 <- c("A","A","A","A","B","B","B","A","A","B","B","A","A")
X3 <- c("B","A","A","A","B","B","B","B","B","B","B","B","B")
X4 <- c("A","A","A","B","B","B","A","A","A","B","A","B","B")
X5 <- c("A","A","A","A","B","B","B","A","A","A","B","B","B")
X6 <- c("A","A","A","A","B","A","B","A","A","B","B","A","A")
mydf <- data.frame (group, X1, X2, X3, X4, X5, X6)

因此数据是:

 group X1 X2 X3 X4 X5 X6
1      1  A  A  B  A  A  A
2      1  A  A  A  A  A  A
3      1  A  A  A  A  A  A
4      1  A  A  A  B  A  A
5      2  B  B  B  B  B  B
6      2  A  B  B  B  B  A
7      2  B  B  B  A  B  B
8      3  A  A  B  A  A  A
9      3  A  A  B  A  A  A
10     4  B  B  B  B  A  B
11     4  B  B  B  A  B  B
12     4  B  A  B  B  B  A
13     4  B  A  B  B  B  A

现在我需要将第一行与组内的其余行进行比较。
   group X1 X2 X3 X4 X5 X6
1      1  A  A  B  A  A  A
2      1  A  A  A  A  A  A
          TRUE TRUE FALSE TRUE TRUE TRUE

这里只有X3不匹配。6个中的1个 = 1/6 = 17%

同样地,将3与第一组中的第一个进行比较。

   group X1 X2 X3 X4 X5 X6
1      1  A  A  B  A  A  A
3      1  A  A  A  A  A  A

匹配率 = 17%

同时将第四个元素与第一组的第一个元素进行比较。

   group X1 X2 X3 X4 X5 X6
1      1  A  A  B  A  A  A
4      1  A  A  A  B  A  A

不匹配 = 2/6 = 34%

对于第2组(即具有行号为5和6的第1行组),同样如此。

     group X1 X2 X3 X4 X5 X6
5      2  B  B  B  B  B  B
6      2  A  B  B  B  B  A

不匹配 = 2/6 = 34%

同理:

         group X1 X2 X3 X4 X5 X6
    5      2  B  B  B  B  B  B
    7      2  B  B  B  A  B  B

不匹配率 = 1/6 = 17%

我的试验:

match (mydf[1,], mydf[2,])
match (mydf[1,], mydf[3,])

2
请问您能否给出您预期的精确输出结果,包括数据结构? - flodel
同一组中的每一行得分都相同吗? - josliber
@josilber 首行与2进行比较并产生不匹配百分比,然后将首行与3进行比较并产生不匹配,以此类推。这个想法是每个组中的第一行充当模板。 - rdorlearn
2个回答

6

试试这个:

match_ratio <- function(x)
   cbind(x, match_ratio = rowMeans(mapply(`==`, x[1, -1], x[, -1])))
library(plyr)
ddply(mydf, "group", match_ratio)

#    group X1 X2 X3 X4 X5 X6 match_ratio
# 1      1  A  A  B  A  A  A   1.0000000
# 2      1  A  A  A  A  A  A   0.8333333
# 3      1  A  A  A  A  A  A   0.8333333
# 4      1  A  A  A  B  A  A   0.6666667
# 5      2  B  B  B  B  B  B   1.0000000
# 6      2  A  B  B  B  B  A   0.6666667
# 7      2  B  B  B  A  B  B   0.8333333
# 8      3  A  A  B  A  A  A   1.0000000
# 9      3  A  A  B  A  A  A   1.0000000
# 10     4  B  B  B  B  A  B   1.0000000
# 11     4  B  B  B  A  B  B   0.6666667
# 12     4  B  A  B  B  B  A   0.5000000
# 13     4  B  A  B  B  B  A   0.5000000

2
不错!ddply很强大。我的解决方案则更加基础。 - hatmatrix

2
## generate pairs of row numbers
rows <- sequence(nrow(mydf))
grid <- subset(expand.grid(Var1=rows,Var2=rows),Var1 > Var2)

## define some functions
comparison1 <- function(a,b,x)
  match(x[a,-1],x[b,-1])

comparison2 <- function(a,b,x)
  x[a,-1]==x[b,-1]

## apply (comparison1 or comparison2)
matches <- t(mapply(comparison1,grid$Var2,grid$Var1,MoreArgs=list(x=mydf)))
dimnames(matches) <- list(paste(grid$Var2,grid$Var1,sep=","),
                          names(mydf)[-1])

如果您使用 comparison1
> head(matches)
    X1 X2 X3 X4 X5 X6
1,2  1  1 NA  1  1  1
1,3  1  1 NA  1  1  1
1,4  1  1  4  1  1  1
1,5 NA NA  1 NA NA NA
1,6  1  1  2  1  1  1
1,7  4  4  1  4  4  4

如果您使用comparison2
> head(matches)
       X1    X2    X3    X4    X5    X6
1,2  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
1,3  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
1,4  TRUE  TRUE FALSE FALSE  TRUE  TRUE
1,5 FALSE FALSE  TRUE FALSE FALSE FALSE
1,6  TRUE FALSE  TRUE FALSE FALSE  TRUE
1,7 FALSE FALSE  TRUE  TRUE FALSE FALSE

行名称对应于您正在比较的行号对。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接