转置和计算皮尔逊相关系数

3
我对编码非常新手,我需要在数据集中运行一些统计数据,例如Pearson相关,但是我在操作数据时遇到了一些问题。据我所知,为了计算Pearson相关,我需要转置我的数据,但这就是我的问题所在。首先,列名变成了新的行而不是新的列名。然后我收到了一个消息,说我的值不是数字。
我还有一些NA,我正在尝试使用这个代码计算相关性。
cor(cr, use = "complete.obs", method = "pearson")
Error in cor(cr1, use = "complete.obs", method = "pearson") : 
  'x' must be numeric

我需要知道维多利亚和努里亚之间的相关性,应该得到0.3651484。

这是我的数据集的dput:

> dput(cr)
structure(list(User = structure(c(8L, 10L, 2L, 17L, 11L, 1L, 
18L, 9L, 7L, 5L, 3L, 14L, 13L, 4L, 20L, 6L, 16L, 12L, 15L, 19L
), .Label = c("Ana", "Anton", "Bernard", "Carles", "Chris", "Ivan", 
"Jim", "John", "Marc", "Maria", "Martina", "Nadia", "Nerea", 
"Nuria", "Oriol", "Rachel", "Roger", "Sergi", "Valery", "Victoria"
), class = "factor"), Star.Wars.IV...A.New.Hope = c(1L, 5L, NA, 
NA, 4L, 2L, NA, 4L, 5L, 4L, 2L, 3L, 2L, 3L, 4L, NA, NA, 4L, 5L, 
1L), Star.Wars.VI...Return.of.the.Jedi = c(5L, 3L, NA, 3L, 3L, 
4L, NA, NA, 1L, 2L, 1L, 5L, 3L, NA, 4L, NA, NA, 5L, 1L, 2L), 
    Forrest.Gump = c(2L, NA, NA, NA, 4L, 4L, 3L, NA, NA, NA, 
    5L, 2L, NA, 3L, NA, 1L, NA, 1L, NA, 2L), The.Shawshank.Redemption = c(NA, 
    2L, 5L, NA, 1L, 4L, 1L, NA, 4L, 5L, NA, NA, 5L, NA, NA, NA, 
    NA, 5L, NA, 4L), The.Silence.of.the.Lambs = c(4L, 4L, 2L, 
    NA, 4L, NA, 1L, 3L, 2L, 3L, NA, 2L, 4L, 2L, 5L, 3L, 4L, 1L, 
    NA, 5L), Gladiator = c(4L, 2L, NA, 1L, 1L, NA, 4L, 2L, 4L, 
    NA, 5L, NA, NA, NA, 5L, 2L, NA, 1L, 4L, NA), Toy.Story = c(2L, 
    1L, 4L, 2L, NA, 3L, NA, 2L, 4L, 4L, 5L, 2L, 4L, 3L, 2L, NA, 
    2L, 4L, 2L, 2L), Saving.Private.Ryan = c(2L, NA, NA, 3L, 
    4L, 1L, 5L, NA, 4L, 3L, NA, NA, 5L, NA, NA, 2L, NA, NA, 1L, 
    3L), Pulp.Fiction = c(NA, NA, NA, 4L, NA, 4L, 2L, 3L, NA, 
    4L, NA, 1L, NA, NA, 3L, NA, 2L, 5L, 3L, 2L), Stand.by.Me = c(3L, 
    4L, 1L, NA, 1L, 4L, NA, NA, 1L, NA, NA, NA, NA, 4L, 5L, 1L, 
    NA, NA, 3L, 2L), Shakespeare.in.Love = c(2L, 3L, NA, NA, 
    5L, 5L, 1L, NA, 2L, NA, NA, 3L, NA, NA, NA, 5L, 2L, NA, 3L, 
    1L), Total.Recall = c(NA, 2L, 1L, 4L, 1L, 2L, NA, 2L, 3L, 
    NA, 3L, NA, 2L, 1L, 1L, NA, NA, NA, 1L, NA), Independence.Day = c(5L, 
    2L, 4L, 1L, NA, 4L, NA, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 3L, NA, 
    NA, NA, NA, NA), Blade.Runner = c(2L, NA, 4L, 3L, 4L, NA, 
    3L, 2L, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 4L, NA, 5L), 
    Groundhog.Day = c(NA, 2L, 1L, 5L, NA, 1L, NA, 4L, 5L, NA, 
    NA, 2L, 3L, 3L, 2L, 5L, NA, NA, NA, 5L), The.Matrix = c(4L, 
    NA, 1L, NA, 3L, NA, 1L, NA, NA, 2L, 1L, 5L, NA, 5L, NA, 2L, 
    4L, NA, 2L, 4L), Schindler.s.List = c(2L, 5L, 2L, 5L, 5L, 
    NA, NA, 1L, NA, 5L, NA, NA, NA, 1L, 3L, 2L, NA, 2L, NA, 3L
    ), The.Sixth.Sense = c(5L, 1L, 3L, 1L, 5L, 3L, NA, 3L, NA, 
    1L, 2L, NA, NA, NA, NA, 4L, NA, 1L, NA, 5L), Raiders.of.the.Lost.Ark = c(NA, 
    3L, 1L, 1L, NA, NA, 5L, 5L, NA, NA, 1L, NA, 5L, NA, 3L, 3L, 
    NA, 2L, NA, 3L), Babe = c(NA, NA, 3L, 2L, NA, 2L, 2L, NA, 
    5L, NA, 4L, 2L, NA, NA, 1L, 4L, NA, 5L, NA, NA)), .Names = c("User", 
"Star.Wars.IV...A.New.Hope", "Star.Wars.VI...Return.of.the.Jedi", 
"Forrest.Gump", "The.Shawshank.Redemption", "The.Silence.of.the.Lambs", 
"Gladiator", "Toy.Story", "Saving.Private.Ryan", "Pulp.Fiction", 
"Stand.by.Me", "Shakespeare.in.Love", "Total.Recall", "Independence.Day", 
"Blade.Runner", "Groundhog.Day", "The.Matrix", "Schindler.s.List", 
"The.Sixth.Sense", "Raiders.of.the.Lost.Ark", "Babe"), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

有人能帮我吗?

2个回答

2

除了@Niek的答案之外,以下是总结。首先使用t()将数据框转置,但要排除第一列(其中包含名称,不是数字,因此不能用于相关计算);在同一步骤中将这些名称分配给新列。然后计算特定的相关性。整体解决方案如下:

cr2 <- setNames(as.data.frame(t(cr[, -1])), cr[, 1])
with(cr2, cor(Victoria, Nuria, use = "complete.obs"))
[1] 0.3651484

或者针对整个相关矩阵进行操作:

cor(cr2, use = "pairwise.complete.obs")

谢谢你的帮助,我需要计算维多利亚和努里亚之间的相关性。我稍微修改了你的代码,它可以工作了。---- cr1 <- transpose(cr) cr2 <- as.data.frame(sapply(cr1, function(x) as.numeric(x))) with(cr2, cor(V12, V15, use = "complete.obs", method = "pearson")) 我只需要将V1、V2等重命名为实际名称,有什么简单的方法可以做到这一点吗? - bgg

2
这段代码应该为您提供所有用户之间的相关矩阵。
cr2<-t(cr[,2:21]) # Transpose (first column contains names)
colnames(cr2)<-cr[,1] # Assign names to columns

cor(cr2,use="complete.obs") # Gives an error because there are no complete obs
# Error in cor(cr2, use = "complete.obs") : no complete element pairs

cor(cr2,use="pairwise.complete.obs") # use pairwise deletion

维多利亚和努里亚之间的相关性为0.36514837(使用成对删除)。

编辑:要获取仅使用列表法删除的维多利亚和努里亚之间的相关性,请运行上述内容,然后进行以下操作。

cr2<-as.data.frame(cr2)
with(cr2, cor(Victoria, Nuria, use = "complete.obs", method = "pearson"))
[1] 0.3651484

谢谢您的帮助, 我遇到了这个错误 --- > colnames(cr2)<-cr[,1] Error in dimnames(x) <- dn : 'dimnames' [2] 的长度与数组范围不相等 - bgg
我重新运行了代码,没有得到相同的错误,也许你在转置时忘记排除第一列了?'cr2' 应该是一个20x20的矩阵。 - Niek
不确定发生了什么,这是我能做到的唯一方法 ------- cr1 <- transpose(cr) -------
cr2 <- as.data.frame(sapply(cr1, function(x) as.numeric(x))) ------
with(cr2, cor(V12, V15, use = "complete.obs", method = "pearson"))我真的想使用名称而不是V,但至少它正在工作。
- bgg
我已经让它工作了,我之前使用了 cr <- tbl_df(cr) -- 但是如果我跳过这一步,它就可以工作了,不确定为什么,但我很高兴它现在能正常工作了!谢谢 - bgg
1
在进行转置操作时,排除第一列。因此,可以尝试使用t(cr[,2:21])或者t(cr[,-1]),而不是t(cr) - Niek

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接