提取数字和结尾字母或空格的正则表达式

Question

提取数字和结尾字母或空格的正则表达式

7

我目前正在尝试从总是以相同格式出现的字符串中提取数据（从没有API支持的社交网站上抓取）

字符串示例

53.2k Followers, 11 Following, 1,396 Posts
5m Followers, 83 Following, 1.1m Posts

我目前正在使用以下正则表达式: "[0-9]{1,5}([,.][0-9]{1,4})?"，以获取数字部分，并保留逗号和点分隔符。它可以产生如下结果：

53.2, 11, 1,396 
5, 83, 1.1

我需要一个正则表达式，可以捕获数字部分后面的字符，即使它是一个空格。例如：

53.2k, 11 , 1,396
5m, 83 , 1.1m

非常感谢任何帮助

复制所需的R代码

  library(stringr)

  string1 <- ("536.2k Followers, 83 Following, 1,396 Posts")
  string2 <- ("5m Followers, 83 Following, 1.1m Posts")

  info <- str_extract_all(string1,"[0-9]{1,5}([,.][0-9]{1,4})?")
  info2 <- str_extract_all(string2,"[0-9]{1,5}([,.][0-9]{1,4})?")

  info 
  info2

- Permafrost

5个回答

0

如果您希望即使字符后面是一个空格，也能够获取数字部分之后的字符，您可以使用您的模式和一个包括空格的可选字符类 [mk ]?。

[0-9]{1,5}(?:[,.][0-9]{1,4})?[mk ]?

正则表达式演示 | R语言演示

你可以扩展字符类的范围，以匹配[a-zA-Z ]?。如果你想使用量词来匹配一个或多个字符或单个空格，你可以使用交替：

[0-9]{1,5}(?:[,.][0-9]{1,4})?(?:[a-zA-Z]+| )?

- The fourth bird

0

我们可以在正则表达式中添加一个可选的字符参数。

stringr::str_extract_all(string1,"[0-9]{1,5}([,.][0-9]{1,4})?[A-Za-z]?")[[1]]
#[1] "536.2k" "83"     "1,396" 
stringr::str_extract_all(string2,"[0-9]{1,5}([,.][0-9]{1,4})?[A-Za-z]?")[[1]]
#[1] "5m"   "83"   "1.1m"

- Ronak Shah

0

（更新了我之前的帖子，选择了多余的逗号/空格）
这个代码可以满足OP的要求，提取数字部分后面的字母或空格（不包括之前版本中的多余逗号和空格）：

（？：[\d]+[.,]？（？= \d *）[\d]* [km ]？）

之前的版本：\b（？：[\d.,]+[km\s]？）

Explanation:  
- (?:          indicates non-capturing group
- [\d]+        matches 1 or more digits
- [.,]?(?=\d*) matches 0 or 1 decimal_point or comma that is immediately followed ("Positive Lookahead") by 1 or more digits
- [\d]*        matches 0 or more digits
- [km\s]?      matches 0 or 1 of characters within []

53.2k Followers, 11 Following, 1,396 Posts     
5m Followers, 83 Following, 1.1m Posts  
# 53.2k; 11 ; 1,396
# 5m; 83 ; 1.1m

请注意，空格与 OP 意图相符，分别在 11 和 83 之后。

- SanV

0

另一个 stringr 选项：

new_s<-str_remove_all(unlist(str_extract_all(string2,"\\d{1,}.*\\w")),"[A-Za-z]{2,}")
strsplit(new_s," , ")

    #[[1]]
    #[1] "5m"    "83"    "1.1m "

原文

str_remove_all(unlist(str_extract_all(string2,"\\d{1,}\\W\\w+")),"[A-Za-z]{2,}")
#[1] "83 "  "1.1m"
str_remove_all(unlist(str_extract_all(string1,"\\d{1,}\\W\\w+")),"[A-Za-z]{2,}")
#[1] "536.2k" "83 "    "1,396"

- NelsonGon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tim Biegeleisen · Accepted Answer

我建议使用以下正则表达式模式：

[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*

这个模式会生成您期望的输出。以下是解释：

[0-9]{1,3}      match 1 to 3 initial digits
(?:,[0-9]{3})*  followed by zero or more optional thousands groups
(?:\\.[0-9]+)?  followed by an optional decimal component
[A-Za-z]*       followed by an optional text unit

在可能的情况下，我倾向于使用基本的 R 解决方案，以下是一种使用 gregexpr 和 regmatches 的方法：

txt <- "53.2k Followers, 11 Following, 1,396 Posts"
m <- gregexpr("[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*", txt)
regmatches(txt, m)

[[1]]
[1] "53.2k"   "11"   "1,396"