我目前正在尝试对来自RNA表达的log2cpm数据运行PCA。 我已经对数据进行了以下预处理:
- 上传我的表达数据集
- 基于我想进一步研究的基因(分数列表)的选择,过滤出基因。
设置控制组和治疗组的数据集:
dataset <- read.table("log2cpm.txt", sep="\t", header = TRUE, row.names = NULL) %>% na.omit()#dataset
dataset <- dataset[!duplicated(dataset$hgnc_symbol), ]
row.names(dataset) <- dataset$hgnc_symbol
#Set genedabase
gene_DB <- read.table("TableS1.txt", sep="\t", header = TRUE) #selection
gene_DB <- gene_DB[!duplicated(gene_DB$Symbol), ]
row.names(gene_DB) <- gene_DB$Symbol
我随后对基因进行了筛选:
#Filter genes from dataset based on imported database
dataset_filtered <- dataset %>% filter(hgnc_symbol %in% gene_DB$Symbol)
接下来我对数据帧进行了转置(翻转),并将其转换为矩阵:
data_tsc <- t(as.matrix(dataset_filtered))
colnames(data_tsc) <- c(data_tsc[2,1:ncol(data_tsc)])
data_tsc <- data_tsc[c(-1,-2),]
你可以在代码中看到,我总是尽量保留行名(样本)和列名(基因),这样当进行PCA和数据处理时,我可以理解一些内容并跟踪超过300个基因。
然而,当我运行矩阵(data_tsc)通过PCA分析时,这种方法却不起作用。
#Run PCA####
pca <- prcomp(data_tsc[,c(1:ncol(data_tsc))], center = TRUE,scale. = TRUE)
这会返回:
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
经过激烈的谷歌搜索,我确定了问题:as.matrix
和 t()
操作将数字值转换为了chr
。
我尝试过很多次通过函数进行修正,例如:apply,lapply,as.numeric 等等。我已经谷歌搜索了大量的解决方案,但是所有的建议都会打乱我的行和列,或者他们会破坏整个数据集。
那么,有没有一种简单快速的方法来将chr
值转换为数字,同时仍然保留我的行和列呢?非常感谢!:D
附:我只是在学习编程,但是遇到了一些问题。
更改:
NelsonGon要求我提供此输入:
dput(head(data_tsc))
返回了哪些结果
structure(c("4,891962697", "4,807689723", "5,07457417", "5,086369154",
"4,914961379", "4,83431453", "6,583923027", "6,482957338", "6,587420199",
"6,532262901", "6,438933039", "6,448834899", "2,832721409", "2,881398092",
"2,389231753", "2,780670224", "2,417835957", "2,761576388", "7,494008371",
"7,58143903", "7,62969704", "7,579694323", "7,438227488", "7,513190279",
"6,257073157", "6,351044394", "6,313216639", "6,597298125", "6,112566161",
"6,315617767", "6,822914122", "6,660904066", "6,925653718", "7,379973187",
"6,804033651", "6,443382931", "5,271577287", "5,510134745", "5,418971124",
"5,551120518", "5,302474278", "5,552416478", "5,165993558", "5,030291607",
"5,145076323", "4,905049925", "5,202651513", "5,250135996", "2,827019018",
"2,626020468", "2,702723667", "2,575260635", "2,30347029", "2,449794083",
"5,866824758", "5,881522359", "5,913145862", "5,922174742", "5,869024665",
"5,896680873"), .Dim = c(6L, 10L), .Dimnames = list(c("LIG_UT_1",
"LIG_UT_2", "LIG_UT_3", "LIG_UT_4", "LIG_UT_5", "LIG_UT_6"),
c("ACVR1", "ADAM17", "AGER", "AKT1", "ANPEP", "ANXA1", "AR",
"ATM", "AURKA", "AXIN1")))
第二次建议后的更改: 我在read.table()中进行了更改。
dataset <- read.table("log2cpm.txt", sep="\t", header = TRUE, row.names = NULL, dec = ",")
指定 dec = ","
这将在 dput 中生成以下输出:
structure(c(" 4.8919627", " 4.8076897", " 5.0745742", " 5.0863692",
"4.9149614","4.8343145","6.5839230","6.4829573","6.5874202", "6.5322629","6.4389330","6.4488349","2.8327214","2.8813981", "2.3892318","2.7806702","2.4178360","2.7615764","7.4940084", "7.5814390","7.6296970","7.5796943","7.4382275","7.5131903", "6.2570732","6.3510444","6.3132166","6.5972981","6.1125662", "6.3156178","6.8229141","6.6609041","6.9256537","7.3799732", "6.8040337","6.4433829","5.2715773","5.5101347","5.4189711", "5.5511205","5.3024743","5.5524165","5.1659936","5.0302916", "5.1450763","4.9050499","5.2026515","5.2501360","2.8270190", "2.6260205","2.7027237","2.5752606","2.3034703","2.4497941", "5.8668248","5.8815224","5.9131459","5.9221747","5.8690247", "5.8966809"),.Dim = c(6L,10L),.Dimnames = list(c("LIG_UT_1", "LIG_UT_2","LIG_UT_3","LIG_UT_4","LIG_UT_5","LIG_UT_6"), c("ACVR1","ADAM17","AGER","AKT1","ANPEP","ANXA1","AR", "ATM","AURKA","AXIN1")))
Based on Adams suggestion prrevious suggestion to add dec = "," in read.table, and to afterwards use use the following code:
dataset_numeric <- apply(data_tsc, 2, as.numeric)
rownames(data_numeric) <- rownames(data_tsc)
colMeans(data_tsc)
我成功地将字符值转化为数字,同时保留了行和列。PCA运作良好,并且:
is.numeric(dataset_numeric)
[1] 真
感谢大家的帮助,我差点因为沮丧而掉光头发。
dput(head(dataset_name))
提供样本数据以获得更好的可重复性。 - NelsonGon