如何使用R从一个包含多列的数据框计算出（共现）矩阵？

Question

如何使用R从一个包含多列的数据框计算出（共现）矩阵？

rdplyrigraphadjacency-matrixfind-occurrences

7

我是R语言的新手，目前正在处理一份包含32列和约200,000行的协作数据，该数据以边列表示。我希望创建一个基于国家间互动的（共现）矩阵。但是，我想通过对象总数来计算交互次数。

预期结果的基本示例

如果在一行中"England"出现了三次而"China"只出现了一次，则结果应该是以下矩阵。

         England  China
England    3        3
China      3        1

可重复的示例

df <- data.frame(ID = c(1,2,3,4), 
 V1 = c("England", "England", "China", "England"),
 V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))

因此，一个示例数据框目前看起来像这样：

ID  V1       V2       ...   V32
1   England  Greece         USA
2   England  England        China
3   China    Greece         Greece
4   England  England        England
.
.
.

预期结果

我想按行计算（共同）出现次数，而不考虑顺序，以获得一个（共同）出现矩阵，可以解决边缘环的低频率问题（例如英格兰-英格兰），从而得出以下结果：

         China   England   Greece   USA

China    2        2         2        0

England  2        6         1        1

Greece   2        1         3        1

USA      0        1         1        1

目前为止已经尝试的方法

我已经使用了igraph来获取共现的邻接矩阵。然而，它只计算相同两个对象之间不超过两次的互动，有时会使得某些行/出版物的实际频率值远低于预期。

df <- data.frame(ID = c(1,2,3,4), 
 V1 = c("England", "England", "China", "England"),
 V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))

# remove ID column

df[1] <- list(NULL)

# calculate co-occurrences and return as dataframe

library(igraph)
library(Matrix)

countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))

         China   England   Greece   USA

China    0        0         1        0

England  0        2         1        0

Greece   1        1         0        0

USA      0        0         0        0

我假设使用base和/或dplyr和/或table和/或reshape2类似于[1]、[2]、[3]、[4]或[5]一定有简单的解决方案，但目前为止没有一个适用的代码且我无法根据我的需求进行调整。我还尝试将[6]用作基础，但是，这里也存在相同的问题。

library(tidry)
library(dplyr)
library(stringr)


# collapse observations into one column

df2 <- df %>% unite(concat, V1:V32, sep = ",")

# calculate weights

df3 <- df2$concat %>%
  str_split(",") %>%
  lapply(function(x){
    expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
  }) %>%
  bind_rows

df4 <- apply(df3[, -5], 1, sort) %>%
  t %>%
  data.frame(stringsAsFactors = FALSE) %>%
  mutate(w = df3$w)

如果有人能指点我正确的方向，我会很高兴。

- Seb

1

你的数据更像是交易清单，建议使用arules包中的函数来处理数据，避免过多的额外操作。 - emilliman5

在你的“基本示例”中，为什么中国-英格兰或英格兰-中国的条目是3？你能解释一下吗？ - ThomasIsCoding

1

我的理解是，您的最终结果似乎有两个过程; 即当国家共存/它们之间存在边缘时，非对角线表示共现情况，而对角线仅表示原始计数？ - user20650

@user20650 - 是的，那已经非常有帮助了！谢谢你！ :) - Seb

@ThomasIsCoding - 嗯，主要的想法是成对计数。我想将出现最多的两个字符串/观察值的出现值分配给它连接的另一个字符串。 - Seb

显示剩余3条评论

3个回答

1

以下是使用dplyr和tidyr包的方法，整个想法在于创建一个数据框，其中每个国家的出现次数按行排列，然后将其连接到自身。

library(dplyr)

# Create dataframe sammple
df <- data.frame(ID = c(1,2,3,4), 
                 V1 = c("England", "England", "China", "England"),
                 V2 = c("Greece", "England", "Greece", "England"),
                 V32 = c("USA", "China", "Greece", "England"),
                 stringsAsFactors = FALSE)

# Get the occurance of each country in every row.
row_occurance <- 
  df %>%
  tidyr::gather(key = "identifier", value = "country", -ID) %>%
  group_by(ID, country) %>%
  count()

row_occurance %>%
  # Join row_occurance on itself to simulate the matrix
  left_join(row_occurance, by = "ID") %>%
  # Get the highest occurance row wise, this to handle when country
  # name is repeated within same row
  mutate(Occurance = pmax(n.x, n.y)) %>%
  # Group by 2 countries
  group_by(country.x, country.y) %>%
  # Sum the occurance of 2 countries together
  summarise(Occurance = sum(Occurance)) %>%
  # Spread the data to make it in matrix format
  tidyr::spread(key = "country.y", value = "Occurance", fill = 0)

# # A tibble: 4 x 5
# # Groups:   country.x [4]
# country.x China England Greece   USA
# <chr>     <dbl>   <dbl>  <dbl> <dbl>
# China         2       2      2     0
# England       2       6      1     1
# Greece        2       1      3     1
# USA           0       1      1     1

- Nareman Darwish

非常感谢！特别是对于那些解释。 - Seb

1

一种使用base::table的选项：

df <- data.frame(ID = c(1,2,3,4), 
    V1 = c("England", "England", "China", "England"),
    V2 = c("Greece", "England", "Greece", "England"),
    V3 = c("USA", "China", "Greece", "England"))

#get paired combi and remove those from same country
pairs <- as.data.frame(do.call(rbind, 
    by(df, df$ID, function(x) t(combn(as.character(x[-1L]), 2L)))))
pairs <- pairs[pairs$V1!=pairs$V2, ]

#repeat data frame with columns swap so that 
#upper and lower tri have same numbers and all countries are shown
pairs <- rbind(pairs, data.frame(V1=pairs$V2, V2=pairs$V1))

#tabulate pairs
tab <- table(pairs)

#set diagonals to be the count of countries
cnt <- c(table(unlist(df[-1L])))
diag(tab) <- cnt[names(diag(tab))]

tab

输出：

         V2
V1        China England Greece USA
  China       2       2      2   0
  England     2       6      1   1
  Greece      2       1      3   1
  USA         0       1      1   1

- chinsoon12

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matt · Accepted Answer

有更好的方法来实现这个，不过可以尝试：

library(tidyverse)

df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>% 
crossprod(., .) 

df_diag <- df %>% 
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>% 
diag()

diag(df1) <- df_diag 

df1

Country   China England Greece USA
  China       2       2      2   0
  England     2       6      1   1
  Greece      2       1      3   1
  USA         0       1      1   1