基于数字单词模式的字符串分割

Question

基于数字单词模式的字符串分割

6

我有一个数据框，看起来像这样：

V1                        V2
peanut butter sandwich    2 slices of bread 1 tablespoon peanut butter

我将努力提供以下内容：

我的目标是：

V1                        V2
peanut butter sandwich    2 slices of bread
peanut butter sandwich    1 tablespoon peanut butter

我尝试使用strsplit(df$v2, " ")来分割字符串，但我只能通过" "来分割。我不确定是否可以仅在第一个数字处分割字符串，然后取到下一个数字之前的字符。

- yokota

2个回答

5

让我们想象一下，你正在处理类似于：

mydf <- data.frame(
  V1 = c("peanut butter sandwich", "peanut butter and jam sandwich"), 
  V2 = c("2 slices of bread 1 tablespoon peanut butter", 
         "2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam"))  

mydf
##                               V1
## 1         peanut butter sandwich
## 2 peanut butter and jam sandwich
##                                                              V2
## 1                  2 slices of bread 1 tablespoon peanut butter
## 2 2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam

您可以先添加一个在 "V2" 中不希望出现的分隔符，并使用我的 "splitstackshape" 中的 cSplit 函数将数据转化为“长格式”。

library(splitstackshape)
mydf$V2 <- gsub(" (\\d+)", "|\\1", mydf$V2)
cSplit(mydf, "V2", "|", "long")
##                                V1                         V2
## 1:         peanut butter sandwich          2 slices of bread
## 2:         peanut butter sandwich 1 tablespoon peanut butter
## 3: peanut butter and jam sandwich          2 slices of bread
## 4: peanut butter and jam sandwich 1 tablespoon peanut butter
## 5: peanut butter and jam sandwich           1 tablespoon jam

以下内容不足以作为答案单独发布，因为它们是对@Jota方法的变体，但为了完整起见，在此分享：

`data.table`中的`strsplit`

拆分list会自动展开为单列....

library(data.table)
as.data.table(mydf)[, list(
  V2 = unlist(strsplit(as.character(V2), '\\s(?=\\d)', perl=TRUE))), by = V1]

"dplyr" + "tidyr"

你可以使用来自“tidyr”的unnest将列表列展开为长格式....

library(dplyr)
library(tidyr)
mydf %>% 
  mutate(V2 = strsplit(as.character(V2), " (?=\\d)", perl=TRUE)) %>% 
  unnest(V2)

- A5C1D2H2I1M1N2O1R2T1

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jota · Accepted Answer

您可以按照以下方式拆分字符串：

txt <- "2 slices of bread 1 tablespoon peanut butter"

strsplit(txt, " (?=\\d)", perl=TRUE)[[1]]
#[1] "2 slices of bread"          "1 tablespoon peanut butter"

这里使用的正则表达式是寻找一个数字后跟随着空格。它使用了零宽度正向先行断言(?=)，表示如果空格后面跟着一个数字(\\d)，那么这就是我们想要分割的类型的空格。为什么使用零宽度先行断言呢？因为我们不想将数字用作分割字符，我们只想匹配任何后面跟着数字的空格。

要使用这个想法并构建您的数据帧，请参见以下示例：

item <- c("peanut butter sandwich", "onion carrot mix", "hash browns")
txt <- c("2 slices of bread 1 tablespoon peanut butter", "1 onion 3 carrots", "potato")
df <- data.frame(item, txt, stringsAsFactors=FALSE)

# thanks to Ananda for recommending setNames
split.strings <- setNames(strsplit(df$txt, " (?=\\d)", perl=TRUE), df$item) 
# alternately: 
#split.strings <- strsplit(df$txt, " (?=\\d)", perl=TRUE)
#names(split.strings) <- df$item

stack(split.strings)
#                      values                    ind
#1          2 slices of bread peanut butter sandwich
#2 1 tablespoon peanut butter peanut butter sandwich
#3                    1 onion       onion carrot mix
#4                  3 carrots       onion carrot mix
#5                     potato            hash browns

基于数字单词模式的字符串分割

data.table中的strsplit

"dplyr" + "tidyr"

`data.table`中的`strsplit`