将矩阵/数据框中多列字符串拆分为新列。

3

如何将数据框/矩阵的多列拆分并将结果保存为数据框? 每个单元格有两个字符,但我只想要每个单元格一个字符和下一个字符对。

我尝试了这些方法,但仍然无法从拆分后的结果中获取返回的数据框。

geno_splitted_ls <- apply(mGenotype, 2, strsplit, split="") #split each column

do.call("as.data.frame", geno_splitted_ls) #collect results as dataframe fails
lapply(geno_splitted_ls, data.frame) #collect results as dataframe fails

数据看起来像这样

> dput(mGenotype)
structure(c("gg", "gg", "gg", "gg", "gt", "gg", "gg", "tg", "gg", 
"gg", "aa", "aa", "ac", "aa", "ca", "aa", "aa", "aa", "aa", "ac", 
"tt", "tt", "ct", "cc", "tt", "tt", "ct", "tc", "tc", "tt", "aa", 
"aa", "ag", "aa", "ga", "ga", "aa", "aa", "aa", "ag", "aa", "aa", 
"aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "tt", "tt", "tt", 
"tt", "tt", "gt", "tt", "tt", "tt", "tt"), dim = c(10L, 6L), dimnames = list(
    NULL, c("genotype1", "genotype2", "genotype3", "genotype4", 
    "genotype5", "genotype6")))

你能提供所需的输出吗,即使只有一两行? - dandrews
4个回答

4

这是一个基于 tidyverse 的解冔方案:

library(tidyverse)

genotype <- structure(c("gg", "gg", "gg", "gg", "gt", "gg", "gg", "tg", "gg", 
                        "gg", "aa", "aa", "ac", "aa", "ca", "aa", "aa", "aa", "aa", "ac", 
                        "tt", "tt", "ct", "cc", "tt", "tt", "ct", "tc", "tc", "tt", "aa", 
                        "aa", "ag", "aa", "ga", "ga", "aa", "aa", "aa", "ag", "aa", "aa", 
                        "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "tt", "tt", "tt", 
                        "tt", "tt", "gt", "tt", "tt", "tt", "tt"), dim = c(10L, 6L), dimnames = list(
                          NULL, c("genotype1", "genotype2", "genotype3", "genotype4", 
                                  "genotype5", "genotype6")))

genotype %>%
  as.data.frame() %>%
  mutate(across(everything(), ~str_split(.x, "", simplify = TRUE)))
#>    genotype1.1 genotype1.2 genotype2.1 genotype2.2 genotype3.1 genotype3.2
#> 1            g           g           a           a           t           t
#> 2            g           g           a           a           t           t
#> 3            g           g           a           c           c           t
#> 4            g           g           a           a           c           c
#> 5            g           t           c           a           t           t
#> 6            g           g           a           a           t           t
#> 7            g           g           a           a           c           t
#> 8            t           g           a           a           t           c
#> 9            g           g           a           a           t           c
#> 10           g           g           a           c           t           t
#>    genotype4.1 genotype4.2 genotype5.1 genotype5.2 genotype6.1 genotype6.2
#> 1            a           a           a           a           t           t
#> 2            a           a           a           a           t           t
#> 3            a           g           a           a           t           t
#> 4            a           a           a           a           t           t
#> 5            g           a           a           a           t           t
#> 6            g           a           a           a           g           t
#> 7            a           a           a           a           t           t
#> 8            a           a           a           a           t           t
#> 9            a           a           a           a           t           t
#> 10           a           g           a           a           t           t

2023-03-19创建,使用reprex v2.0.2生成


R自动添加.1、.2等后缀是否有特定原因?或者我们可以调整这种行为吗?比如说,只是为了好玩从.2开始,然后再是.1?非常感谢。我只是好奇,并不影响你的回答! - TarJae
1
嗨@TarJae :) 加上“.1”和“.2”只是默认值; 您可以根据需要将列名称“粘合在一起”,例如(https://github.com/tidyverse/dplyr/blob/main/R/across.R#L29) genotype%>% as.data.frame() %>% transmute(across(everything(), list(\2` = ~str_split(.x, "", simplify = TRUE)[1], `1`= ~str_split(.x, "", simplify = TRUE)[2]))) 将从“genotype1_2”开始,下一列将是“genotype1_1”。源代码有点“复杂”,但across()`非常灵活。 - jared_mamrot
谢谢@jared_mamrot。很有道理。我必须更多地使用 glue 函数! - TarJae

3

这可能是对@jared_mamrot上面答案的一个不太简洁的版本...

as.data.frame(mGenotype) %>% 
  mutate(across(everything(),
                ~ substr(.,1,1),
                .names = "first_{.col}")) %>% 
  mutate(across(genotype1:genotype6,
                ~ substr(.,2,2),
                .names = "second_{.col}")) %>% 
  dplyr::select(!starts_with('genotype'))

但如果你和我一样需要先逐步了解事情,这可能会有所帮助。


2

你可以在 MARGIN=1 处进行分割,unlist 元素并进行 t 转置。

apply(mGenotype, 1, strsplit, split="") |> sapply(unlist) |> t()
#       genotype11 genotype12 genotype21 genotype22 genotype31 genotype32 genotype41 genotype42 genotype51 genotype52 genotype61 genotype62
#  [1,] "g"        "g"        "a"        "a"        "t"        "t"        "a"        "a"        "a"        "a"        "t"        "t"       
#  [2,] "g"        "g"        "a"        "a"        "t"        "t"        "a"        "a"        "a"        "a"        "t"        "t"       
#  [3,] "g"        "g"        "a"        "c"        "c"        "t"        "a"        "g"        "a"        "a"        "t"        "t"       
#  [4,] "g"        "g"        "a"        "a"        "c"        "c"        "a"        "a"        "a"        "a"        "t"        "t"       
#  [5,] "g"        "t"        "c"        "a"        "t"        "t"        "g"        "a"        "a"        "a"        "t"        "t"       
#  [6,] "g"        "g"        "a"        "a"        "t"        "t"        "g"        "a"        "a"        "a"        "g"        "t"       
#  [7,] "g"        "g"        "a"        "a"        "c"        "t"        "a"        "a"        "a"        "a"        "t"        "t"       
#  [8,] "t"        "g"        "a"        "a"        "t"        "c"        "a"        "a"        "a"        "a"        "t"        "t"       
#  [9,] "g"        "g"        "a"        "a"        "t"        "c"        "a"        "a"        "a"        "a"        "t"        "t"       
# [10,] "g"        "g"        "a"        "c"        "t"        "t"        "a"        "g"        "a"        "a"        "t"        "t"     

如果你需要一个数据框,只需将另一个|> as.data.frame()管道传递进去即可。

1

base R 中使用 read.fwf

read.fwf(textConnection(do.call(paste, c(as.data.frame(genotype), sep = ""))),
     widths = rep(1, max(nchar(c(genotype)) * ncol(genotype))))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接