让我们想象一下,你正在处理类似于:
mydf <- data.frame(
V1 = c("peanut butter sandwich", "peanut butter and jam sandwich"),
V2 = c("2 slices of bread 1 tablespoon peanut butter",
"2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam"))
mydf
## V1
## 1 peanut butter sandwich
## 2 peanut butter and jam sandwich
## V2
## 1 2 slices of bread 1 tablespoon peanut butter
## 2 2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam
您可以先添加一个在 "V2" 中不希望出现的分隔符,并使用我的 "splitstackshape" 中的 cSplit
函数将数据转化为“长格式”。
library(splitstackshape)
mydf$V2 <- gsub(" (\\d+)", "|\\1", mydf$V2)
cSplit(mydf, "V2", "|", "long")
## V1 V2
## 1: peanut butter sandwich 2 slices of bread
## 2: peanut butter sandwich 1 tablespoon peanut butter
## 3: peanut butter and jam sandwich 2 slices of bread
## 4: peanut butter and jam sandwich 1 tablespoon peanut butter
## 5: peanut butter and jam sandwich 1 tablespoon jam
以下内容不足以作为答案单独发布,因为它们是对@Jota方法的变体,但为了完整起见,在此分享:
data.table
中的strsplit
拆分list
会自动展开为单列....
library(data.table)
as.data.table(mydf)[, list(
V2 = unlist(strsplit(as.character(V2), '\\s(?=\\d)', perl=TRUE))), by = V1]
"dplyr" + "tidyr"
你可以使用来自“tidyr”的unnest
将列表列展开为长格式....
library(dplyr)
library(tidyr)
mydf %>%
mutate(V2 = strsplit(as.character(V2), " (?=\\d)", perl=TRUE)) %>%
unnest(V2)
(?=)
的作用可能是值得的。 - tblznbitsstack
。你可以使用setNames
来缩短代码。+1 - A5C1D2H2I1M1N2O1R2T1library(dplyr);library(tidyr);df %>% mutate(txt = strsplit(txt, " (?=\\d)", perl=TRUE)) %>% unnest(txt)
... - A5C1D2H2I1M1N2O1R2T1