在一个单词和数字之间拆分字符串

Question

在一个单词和数字之间拆分字符串

15

我有一些类似以下的文本：

foo_text <- c(
  "73000 PARIS   74000 LYON",
  "75 000 MARSEILLE 68483 LILLE",
  "60  MARSEILLE 68483 LILLE"
)

我希望您能够将每个元素在第一个单词后分成两部分。期望输出:

"73000 PARIS" "74000 LYON" "75000 MARSEILLE" "68483 LILLE" "60 MARSEILLE" "68483 LILLE"

请注意，原始文本中两个元素之间的空格数量不一定相同（例如，PARIS和74000之间的空格数量与MARSEILLE和68483之间的空格数量不同）。此外，有时第一个数字中有一个空格（例如75 000），有时没有空格（例如73000）。我尝试了适应这个答案，但没有成功：this answer。

(delimitedString = gsub( "^([a-z]+) (.*) ([a-z]+)$", "\\1,\\2", foo_text))

有什么想法如何实现这个？

- bretauv

4个回答

3

您正在使用具有锚定的模式^([a-z]+) (.*) ([a-z]+)$和 gsub 进行匹配，它匹配了字符串开头和结尾处的 [a-z] 字符，但不考虑数字，并且由于锚点不能匹配同一字符串中的多个部分。

针对您的示例数据，您也可以匹配第一部分包含数字和空格，后跟一个或多个无数字的部分的所有部分。

library(stringr)
s <- c(
  "73000 PARIS   74000 LYON",
  "75 000 MARSEILLE 68483 LILLE",
  "60  MARSEILLE 68483 LILLE"
)
unlist(str_match_all(s, "\\b\\d[\\d\\s]*(?:\\s+[^\\d\\s]+)+"))

输出

[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"     
[5] "60  MARSEILLE"    "68483 LILLE"

看一个R演示和一个正则表达式演示。

- The fourth bird

3

另一种基于 tidyverse 的可能解决方案：

library(tidyverse) 

foo_text <- c(
  "73000 PARIS   74000 LYON",
  "75 000 MARSEILLE 68483 LILLE",
  "60  MARSEILLE 68483 LILLE"
)

foo_text %>% 
  str_split("(?<=[:alpha:])\\s+(?=\\d)") %>% flatten %>% 
  str_remove_all("(?<=\\d)\\s+(?=\\d)")

#> [1] "73000 PARIS"     "74000 LYON"      "75000 MARSEILLE" "68483 LILLE"    
#> [5] "60  MARSEILLE"   "68483 LILLE"

- PaulS

3

这里有一些其他基本的 R 选项

> scan(text = gsub("(?<=\\D)\\s+(?=\\d)", "\n", foo_text, perl = TRUE), sep = "\n", what = "character")
Read 6 items
[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"
[5] "60  MARSEILLE"    "68483 LILLE"

> read.delim2(text = gsub("(?<=\\D)\\s+(?=\\d)", "\n", foo_text, perl = TRUE), header = FALSE)$V1
[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"
[5] "60  MARSEILLE"    "68483 LILLE"

- ThomasIsCoding

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tim Biegeleisen · Accepted Answer

我们可以尝试使用strsplit来进行如下操作：

foo_text <- c(
    "73000 PARIS   74000 LYON",
    "75 000 MARSEILLE 68483 LILLE",
    "60  MARSEILLE 68483 LILLE"
)
output <- unlist(strsplit(foo_text, "(?<=[A-Z])\\s+(?=\\d)", perl=TRUE))
output

[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"
[5] "60  MARSEILLE"    "68483 LILLE"

这里使用的正则表达式模式指定了以下情况下进行拆分：

(?<=[A-Z])  what precedes is an uppercase letter
\\s+        split (and consume) on one or more whitespace characters
(?=\\d)     what follows is a digit