将数据框中的每一行与所有其他行进行比较

3

我有一个tibble,每行对应一个带ID的样本,每个样本将有多个类别和值。我想列出两个或更多类别的值相互匹配的ID对。

# A tibble: 3 x 2
ID    data            
<chr> <list>          
1 ID1   <tibble [1 x 3]>
2 ID2   <tibble [1 x 3]>
3 ID3   <tibble [1 x 3]>

我可以使用sum(match(x$data[[i]], x$data[[j]], nomatch=0)>0)来获取两行之间的匹配计数。

我可以通过将match()放在 i 和< i>j 循环中来得出答案。是否有更好的方法(使用tidyverse)来得到答案?

谢谢。

例如,这里是原始数据框:

   ID category value
1 ID1   length   100
2 ID1     type     L
3 ID1    color  Blue
4 ID2   length   100
5 ID2     type     M
6 ID2    color  Blue
7 ID3   length   150
8 ID3     type     M
9 ID3    color  Blue

输出结果将如下所示:
  ID.a ID.b matches
1  ID1  ID2       2
2  ID1  ID3       1
3  ID2  ID3       2

我使用tidyverse将输入的数据框转换为tibble,如开始所示,然后在循环中使用match

# A tibble: 3 x 2
ID    data            
<chr> <list>          
1 ID1   <tibble [1 x 3]>
2 ID2   <tibble [1 x 3]>
3 ID3   <tibble [1 x 3]>
4个回答

2

可以使用base R的一个选项,利用tablecrossprod函数。将crossprod函数结果的下三角值设置为NA,将其转换为“长”格式(通过转换为data.frame),然后subset非“Freq”列的NA行。

out <- with(df, crossprod(table(paste(category, value), ID)))
out[lower.tri(out, diag = TRUE)] <- NA
subset(as.data.frame.table(out), !is.na(Freq))
#    ID ID.1 Freq
#4 ID1  ID2    2
#7 ID1  ID3    1
#8 ID2  ID3    2

数据

df <- structure(list(ID = c("ID1", "ID1", "ID1", "ID2", "ID2", "ID2", 
"ID3", "ID3", "ID3"), category = c("length", "type", "color", 
 "length", "type", "color", "length", "type", "color"), 
 value = c("100", 
 "L", "Blue", "100", "M", "Blue", "150", "M", "Blue")), 
 class = "data.frame", row.names = c(NA, -9L))

1
df = read.table(text="
ID category value
ID1   length   100
ID1     type     L
ID1    color  Blue
ID2   length   100
ID2     type     M
ID2    color  Blue
ID3   length   150
ID3     type     M
ID3    color  Blue
", header=T, stringsAsFactors = F)

library(tidyverse)

# create a new column that combines category and value
df = df %>% unite(cat_val, category, value, remove = F)

# create vectorised function that counts matches (on that new value)
f = function(x,y) sum(df$cat_val[df$ID == x] == df$cat_val[df$ID == y])
f = Vectorize(f)

data.frame(t(combn(unique(df$ID), 2))) %>%  # create combinations of IDs (as a dataframe)
  mutate(matches = f(X1, X2))               # apply function

#    X1  X2 matches
# 1 ID1 ID2       2
# 2 ID1 ID3       1
# 3 ID2 ID3       2

1

为了完整起见,这里还有一种使用 自连接 的解决方案:

library(data.table)
setDT(x)[x, on = .(category, value), allow = TRUE][
  ID < i.ID, .N, by = .(ID1 = ID, ID2 = i.ID)]
   ID1 ID2 N
1: ID1 ID2 2
2: ID2 ID3 2
3: ID1 ID3 1

数据

x <- readr::read_table(
"i  ID category value
1 ID1   length   100
2 ID1     type     L
3 ID1    color  Blue
4 ID2   length   100
5 ID2     type     M
6 ID2    color  Blue
7 ID3   length   150
8 ID3     type     M
9 ID3    color  Blue")[, -1L]

0

不确定这是否更整洁,但我们可以做如下操作:

library(tidyverse)

combn_join <- function(x) {
  map2(combn(1:3, 2)[1,], combn(1:3, 2)[2,],
       ~ left_join(x[[.x]], x[[.y]], by = c("category", "value")) %>% 
         select(ID.x, ID.y))
}  

df %>%
  split(.$ID) %>%
  combn_join(.) %>%
  do.call(rbind, .) %>%
  filter(!is.na(ID.y)) %>%
  group_by(ID.x, ID.y) %>%
  summarize(matches = n())

结果:

# A tibble: 3 x 3
# Groups:   ID.x [?]
  ID.x  ID.y  matches
  <fct> <fct>   <int>
1 ID1   ID2         2
2 ID1   ID3         1
3 ID2   ID3         2

数据:

df <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L), .Label = c("ID1", "ID2", "ID3"), class = "factor"), category = structure(c(2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L), .Label = c("color", "length", 
"type"), class = "factor"), value = structure(c(1L, 4L, 3L, 1L, 
5L, 3L, 2L, 5L, 3L), .Label = c("100", "150", "Blue", "L", "M"
), class = "factor")), .Names = c("ID", "category", "value"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接