如何使用R从一个包含多列的数据框计算出(共现)矩阵?

7

我是R语言的新手,目前正在处理一份包含32列和约200,000行的协作数据,该数据以边列表示。我希望创建一个基于国家间互动的(共现)矩阵。但是,我想通过对象总数来计算交互次数。

预期结果的基本示例

如果在一行中"England"出现了三次而"China"只出现了一次,则结果应该是以下矩阵。

         England  China
England    3        3
China      3        1

可重复的示例

df <- data.frame(ID = c(1,2,3,4), 
 V1 = c("England", "England", "China", "England"),
 V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))

因此,一个示例数据框目前看起来像这样:
ID  V1       V2       ...   V32
1   England  Greece         USA
2   England  England        China
3   China    Greece         Greece
4   England  England        England
.
.
.

预期结果

我想按行计算(共同)出现次数,而不考虑顺序,以获得一个(共同)出现矩阵,可以解决边缘环的低频率问题(例如英格兰-英格兰),从而得出以下结果:

         China   England   Greece   USA

China    2        2         2        0

England  2        6         1        1

Greece   2        1         3        1

USA      0        1         1        1

目前为止已经尝试的方法

我已经使用了igraph来获取共现的邻接矩阵。然而,它只计算相同两个对象之间不超过两次的互动,有时会使得某些行/出版物的实际频率值远低于预期。

df <- data.frame(ID = c(1,2,3,4), 
 V1 = c("England", "England", "China", "England"),
 V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))

# remove ID column

df[1] <- list(NULL)

# calculate co-occurrences and return as dataframe

library(igraph)
library(Matrix)

countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))


         China   England   Greece   USA

China    0        0         1        0

England  0        2         1        0

Greece   1        1         0        0

USA      0        0         0        0

我假设使用base和/或dplyr和/或table和/或reshape2类似于[1][2][3][4][5]一定有简单的解决方案,但目前为止没有一个适用的代码且我无法根据我的需求进行调整。我还尝试将[6]用作基础,但是,这里也存在相同的问题。
library(tidry)
library(dplyr)
library(stringr)


# collapse observations into one column

df2 <- df %>% unite(concat, V1:V32, sep = ",")

# calculate weights

df3 <- df2$concat %>%
  str_split(",") %>%
  lapply(function(x){
    expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
  }) %>%
  bind_rows

df4 <- apply(df3[, -5], 1, sort) %>%
  t %>%
  data.frame(stringsAsFactors = FALSE) %>%
  mutate(w = df3$w)

如果有人能指点我正确的方向,我会很高兴。


1
你的数据更像是交易清单,建议使用arules包中的函数来处理数据,避免过多的额外操作。 - emilliman5
在你的“基本示例”中,为什么中国-英格兰或英格兰-中国的条目是3?你能解释一下吗? - ThomasIsCoding
1
我的理解是,您的最终结果似乎有两个过程; 即当国家共存/它们之间存在边缘时,非对角线表示共现情况,而对角线仅表示原始计数? - user20650
@user20650 - 是的,那已经非常有帮助了!谢谢你! :) - Seb
@ThomasIsCoding - 嗯,主要的想法是成对计数。我想将出现最多的两个字符串/观察值的出现值分配给它连接的另一个字符串。 - Seb
显示剩余3条评论
3个回答

4

有更好的方法来实现这个,不过可以尝试:

library(tidyverse)

df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>% 
crossprod(., .) 

df_diag <- df %>% 
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>% 
diag()

diag(df1) <- df_diag 

df1

Country   China England Greece USA
  China       2       2      2   0
  England     2       6      1   1
  Greece      2       1      3   1
  USA         0       1      1   1

非常感谢!我需要一些时间来弄清楚代码在做什么,但它运行良好。 - Seb
3
除非我漏掉了什么,否则可以用更少的步骤采用相同的方法:tab = table(df$ID[row(df[-1])], as.matrix(df[-1]));df1 = crossprod(tab);diag(df1) = colSums(tab);df1 - alexis_laz

1

以下是使用dplyr和tidyr包的方法,整个想法在于创建一个数据框,其中每个国家的出现次数按行排列,然后将其连接到自身。

library(dplyr)

# Create dataframe sammple
df <- data.frame(ID = c(1,2,3,4), 
                 V1 = c("England", "England", "China", "England"),
                 V2 = c("Greece", "England", "Greece", "England"),
                 V32 = c("USA", "China", "Greece", "England"),
                 stringsAsFactors = FALSE)

# Get the occurance of each country in every row.
row_occurance <- 
  df %>%
  tidyr::gather(key = "identifier", value = "country", -ID) %>%
  group_by(ID, country) %>%
  count()

row_occurance %>%
  # Join row_occurance on itself to simulate the matrix
  left_join(row_occurance, by = "ID") %>%
  # Get the highest occurance row wise, this to handle when country
  # name is repeated within same row
  mutate(Occurance = pmax(n.x, n.y)) %>%
  # Group by 2 countries
  group_by(country.x, country.y) %>%
  # Sum the occurance of 2 countries together
  summarise(Occurance = sum(Occurance)) %>%
  # Spread the data to make it in matrix format
  tidyr::spread(key = "country.y", value = "Occurance", fill = 0)

# # A tibble: 4 x 5
# # Groups:   country.x [4]
# country.x China England Greece   USA
# <chr>     <dbl>   <dbl>  <dbl> <dbl>
# China         2       2      2     0
# England       2       6      1     1
# Greece        2       1      3     1
# USA           0       1      1     1

非常感谢!特别是对于那些解释。 - Seb

1
一种使用base::table的选项:
df <- data.frame(ID = c(1,2,3,4), 
    V1 = c("England", "England", "China", "England"),
    V2 = c("Greece", "England", "Greece", "England"),
    V3 = c("USA", "China", "Greece", "England"))

#get paired combi and remove those from same country
pairs <- as.data.frame(do.call(rbind, 
    by(df, df$ID, function(x) t(combn(as.character(x[-1L]), 2L)))))
pairs <- pairs[pairs$V1!=pairs$V2, ]

#repeat data frame with columns swap so that 
#upper and lower tri have same numbers and all countries are shown
pairs <- rbind(pairs, data.frame(V1=pairs$V2, V2=pairs$V1))

#tabulate pairs
tab <- table(pairs)

#set diagonals to be the count of countries
cnt <- c(table(unlist(df[-1L])))
diag(tab) <- cnt[names(diag(tab))]

tab

输出:

         V2
V1        China England Greece USA
  China       2       2      2   0
  England     2       6      1   1
  Greece      2       1      3   1
  USA         0       1      1   1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接