如何将字符矩阵转换为数字矩阵而不影响行/列名称

4

我目前正在尝试对来自RNA表达的log2cpm数据运行PCA。 我已经对数据进行了以下预处理:

  • 上传我的表达数据集
  • 基于我想进一步研究的基因(分数列表)的选择,过滤出基因。

设置控制组和治疗组的数据集:

dataset <- read.table("log2cpm.txt", sep="\t", header = TRUE, row.names =  NULL) %>% na.omit()#dataset
        dataset <- dataset[!duplicated(dataset$hgnc_symbol), ]
        row.names(dataset) <- dataset$hgnc_symbol
        #Set genedabase
        gene_DB <- read.table("TableS1.txt", sep="\t", header = TRUE)  #selection
        gene_DB <- gene_DB[!duplicated(gene_DB$Symbol), ]
        row.names(gene_DB) <- gene_DB$Symbol

我随后对基因进行了筛选:

#Filter genes from dataset based on imported database
dataset_filtered <- dataset %>% filter(hgnc_symbol %in% gene_DB$Symbol)

接下来我对数据帧进行了转置(翻转),并将其转换为矩阵:

    data_tsc <- t(as.matrix(dataset_filtered))
colnames(data_tsc) <- c(data_tsc[2,1:ncol(data_tsc)])
data_tsc <- data_tsc[c(-1,-2),]

你可以在代码中看到,我总是尽量保留行名(样本)和列名(基因),这样当进行PCA和数据处理时,我可以理解一些内容并跟踪超过300个基因。
然而,当我运行矩阵(data_tsc)通过PCA分析时,这种方法却不起作用。
    #Run PCA####
pca <- prcomp(data_tsc[,c(1:ncol(data_tsc))], center = TRUE,scale. = TRUE)

这会返回:

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

经过激烈的谷歌搜索,我确定了问题:as.matrixt() 操作将数字值转换为了chr

我尝试过很多次通过函数进行修正,例如:apply,lapply,as.numeric 等等。我已经谷歌搜索了大量的解决方案,但是所有的建议都会打乱我的行和列,或者他们会破坏整个数据集。

那么,有没有一种简单快速的方法来将chr值转换为数字,同时仍然保留我的行和列呢?非常感谢!:D

附:我只是在学习编程,但是遇到了一些问题。

更改:

NelsonGon要求我提供此输入:

dput(head(data_tsc))

返回了哪些结果

structure(c("4,891962697", "4,807689723", "5,07457417", "5,086369154", 
"4,914961379", "4,83431453", "6,583923027", "6,482957338", "6,587420199", 
"6,532262901", "6,438933039", "6,448834899", "2,832721409", "2,881398092", 
"2,389231753", "2,780670224", "2,417835957", "2,761576388", "7,494008371", 
"7,58143903", "7,62969704", "7,579694323", "7,438227488", "7,513190279", 
"6,257073157", "6,351044394", "6,313216639", "6,597298125", "6,112566161", 
"6,315617767", "6,822914122", "6,660904066", "6,925653718", "7,379973187", 
"6,804033651", "6,443382931", "5,271577287", "5,510134745", "5,418971124", 
"5,551120518", "5,302474278", "5,552416478", "5,165993558", "5,030291607", 
"5,145076323", "4,905049925", "5,202651513", "5,250135996", "2,827019018", 
"2,626020468", "2,702723667", "2,575260635", "2,30347029", "2,449794083", 
"5,866824758", "5,881522359", "5,913145862", "5,922174742", "5,869024665", 
"5,896680873"), .Dim = c(6L, 10L), .Dimnames = list(c("LIG_UT_1", 
"LIG_UT_2", "LIG_UT_3", "LIG_UT_4", "LIG_UT_5", "LIG_UT_6"), 
    c("ACVR1", "ADAM17", "AGER", "AKT1", "ANPEP", "ANXA1", "AR", 
    "ATM", "AURKA", "AXIN1")))

第二次建议后的更改: 我在read.table()中进行了更改。
dataset <- read.table("log2cpm.txt", sep="\t", header = TRUE, row.names =  NULL, dec = ",")

指定 dec = ","

这将在 dput 中生成以下输出:

structure(c(" 4.8919627", " 4.8076897", " 5.0745742", " 5.0863692", 

"4.9149614","4.8343145","6.5839230","6.4829573","6.5874202", "6.5322629","6.4389330","6.4488349","2.8327214","2.8813981", "2.3892318","2.7806702","2.4178360","2.7615764","7.4940084", "7.5814390","7.6296970","7.5796943","7.4382275","7.5131903", "6.2570732","6.3510444","6.3132166","6.5972981","6.1125662", "6.3156178","6.8229141","6.6609041","6.9256537","7.3799732", "6.8040337","6.4433829","5.2715773","5.5101347","5.4189711", "5.5511205","5.3024743","5.5524165","5.1659936","5.0302916", "5.1450763","4.9050499","5.2026515","5.2501360","2.8270190", "2.6260205","2.7027237","2.5752606","2.3034703","2.4497941", "5.8668248","5.8815224","5.9131459","5.9221747","5.8690247", "5.8966809"),.Dim = c(6L,10L),.Dimnames = list(c("LIG_UT_1", "LIG_UT_2","LIG_UT_3","LIG_UT_4","LIG_UT_5","LIG_UT_6"), c("ACVR1","ADAM17","AGER","AKT1","ANPEP","ANXA1","AR", "ATM","AURKA","AXIN1")))

Based on Adams suggestion prrevious suggestion to add dec = "," in read.table, and to afterwards use use the following code:

    dataset_numeric <- apply(data_tsc, 2, as.numeric)
rownames(data_numeric) <- rownames(data_tsc)
colMeans(data_tsc)

我成功地将字符值转化为数字,同时保留了行和列。PCA运作良好,并且:

is.numeric(dataset_numeric)

[1] 真

感谢大家的帮助,我差点因为沮丧而掉光头发。


你可以使用 dput(head(dataset_name)) 提供样本数据以获得更好的可重复性。 - NelsonGon
1
哦不!这只是添加数据到问题中,而不是解决它。将其输出复制并粘贴到问题中。 - NelsonGon
嘿,看看修改。我已经加入了你建议的内容 :) - NewbieCoder
1
这太长了吗?你只需要复制并粘贴它原封不动的样子,这是人们访问您的数据的唯一(最佳)方式。 - NelsonGon
是的,这太长了。我添加了10x10个dput(),但问题仍然存在 - 所以这是一个很好的见解 :) 看看更改。 - NewbieCoder
显示剩余4条评论
2个回答

3
问题可能是小数点是逗号而不是句点。请先尝试转换一下。
dataset_numeric <- sub(",",".",dataset)

一旦完成这个步骤,这应该相当简单。如果从这里开始,这可能是以下内容的重复,只是增加了行名称的要求。 将字符矩阵转换为数值矩阵 因此,在这种情况下,您可以稍微修改:
dataset_numeric <- apply(dataset_numeric, 2, as.numeric)
rownames(dataset_numeric) <- rownames(dataset)

或选择此选项:
class(dataset_numeric) <- "numeric"

测试:
prcomp(dataset_numeric, center = TRUE, scale = TRUE)

运行时没有错误:
Standard deviations (1, .., p=6):
[1] 2.191373e+00 1.464462e+00 1.331818e+00 1.002092e+00 5.246949e-01 3.755055e-15

Rotation (n x k) = (10 x 6):
               PC1         PC2         PC3         PC4         PC5         PC6
ACVR1  -0.33509491 -0.32378624  0.35207791  0.04650037 -0.22465986 -0.07403592
ADAM17 -0.26169241 -0.47259488 -0.30394898 -0.13763357 -0.18328981  0.41562880
AGER   -0.07354562  0.38073508 -0.56645061  0.26681868 -0.28597500  0.12602119
AKT1   -0.37111066  0.01674254 -0.07923664 -0.48941844  0.56009962  0.31877982
ANPEP  -0.41234145  0.25398752 -0.06276181  0.12397346 -0.28744359  0.12200886
ANXA1  -0.34908735 -0.20718967  0.18610579  0.51004989 -0.01539492  0.28629143
AR     -0.23808868  0.54584757  0.08481153 -0.27218135 -0.07711181  0.16714943
ATM     0.37104240 -0.14079095  0.04995052 -0.44945864 -0.56884559  0.33723134
AURKA  -0.20262305 -0.29758992 -0.57407802 -0.16727601 -0.03025329 -0.51762461
AXIN1  -0.38573848  0.11317416  0.28050560 -0.29761514 -0.32731009 -0.44477115

嗨,请查看上面主要问题中的更改部分2 :) - NewbieCoder
那应该解决了第一行 sub() 的问题。其余部分还是不起作用吗? - user10917479
嘿,它起作用了!非常感谢。请查看原问题中的解决方案 :) - NewbieCoder
1
顺便说一句,我把那只苍蝇的腿勾选为批准答案了 :) - NewbieCoder

0
我建议在R中使用dplyr :: mutate来解决这个问题: mat.char%>%data.frame()%>%mutate(across(where(is.character),as.numeric))%>%as.matrix() - &gt; mat.num

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接