正则表达式去除所有数字，除了序数。

Question

正则表达式去除所有数字，除了序数。

3

寻找在Ruby中与#gsub一起使用的正则表达式，以去除字符串中除序数外的所有数字。假设以下内容可以保留我想要的字符串内容：

string = "100 red balloons"
strip_digits = string.gsub(/[^a-zA-Z\s]/, '')
=> " red balloons"

我应该如何修改strip_digits中的正则表达式，以便在以下情况下：

string = "50th red balloon"

strip_digits会返回：

=> "50th red balloon"

也就是说，正则表达式会忽略作为序数一部分的数字，而在其他情况下匹配它们。

对于这个示例来说，可以安全地假定任何紧接着序数指示符（“nd”、“th”、“rd”或“st”）的数字字符串都是序数。

- sarkon

1

如果你只需调整正则表达式，可以使用gsub(/(\d+(?:th|[rn]d|st))|[^a-z\s]/i, "\\1")。 - Wiktor Stribiżew

你将如何判断数字后面的字母是序数还是其他内容？例如，在 TP-Link TL-WR1043ND 中。 - sawa

@sawa 不，我不想删除第二和第三行。我想修改第二行的正则表达式，使其在删除数字的同时忽略作为序数（“50th”）一部分的数字（“50”）。对于我的示例，可以安全地假设任何紧接着“nd”，“st”或“th”的数字是序数。 - sarkon

你所关心的序数词后缀只有这三个吗？ - sawa

@sarkon：太好了，我附带了一点解释。 - Wiktor Stribiżew

显示剩余5条评论

3个回答

0

您可以使用单词边界\b，即：

strip_digits = string.gsub(/\b\d+(?!st|th|rd|nd)\b/, '')

正则表达式解释：

\b\d+(?!st|th|rd|nd)\b

Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore) «\b»
Match a single character that is a “digit” (ASCII 0–9 only) «\d+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!st|th|rd|nd)»
   Match this alternative (attempting the next alternative only if this one fails) «st»
      Match the character string “st” literally (case sensitive) «st»
   Or match this alternative (attempting the next alternative only if this one fails) «th»
      Match the character string “th” literally (case sensitive) «th»
   Or match this alternative (attempting the next alternative only if this one fails) «rd»
      Match the character string “rd” literally (case sensitive) «rd»
   Or match this alternative (the entire group fails if this one fails to match) «nd»
      Match the character string “nd” literally (case sensitive) «nd»
Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore) «\b»

Regex101演示

- Pedro Lobito

OP写道“序数（包括'st'和'nd'序数）”，这意味着还有其他词缀。 - sawa

@sawa "对于这个例子，可以安全地假设任何紧随序数指示符（“nd”，“th”，“rd”或“st”）的数字字符串都是序数。" - Pedro Lobito

你在不理解问题的情况下对答案进行了负评。 - Pedro Lobito

1

哦，好的。（你没有否认它！） - 不管怎样，问题已解决，已点赞。 - Scott Weaver

@Sweaver2112 感谢您的赞赏 - Pedro Lobito

显示剩余2条评论

0

你可以使用负向先行断言：（这也会折叠额外的空格）

 t = "And on 3rd day, he created the 1st of his 22 books, not including the 3 that were never published - this was the 2nd time this happened."
 print(t.gsub(/\s*\d+(?!st|th|rd|nd)\s*/, " "))#=>And on 3rd day, he created the 1st of his books, not including the that were never published - this was the 2nd time this happened.

IDEONE演示

- Scott Weaver

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

正如对你的正则表达式进行“修复”一样，我建议：

input.gsub(/(\d+(?:th|[rn]d|st))|[^a-z\s]/i, "\\1")

请查看这里的IDEONE演示

逻辑如下：匹配并捕获所有后面跟着序数词尾缀的数字，并将其恢复为替换模式中的\1反向引用，然后使用[^a-z\s]（或[^\p{L}\s]）匹配（删除）所有非字母和非空格。

模式详细信息：

(\d+(?:th|[rn]d|st)) - 匹配1个或多个数字 (\d+)，后面跟着 th, rd, nd 或者 st 中的任意一个（所有子字符串都被存储在编号缓冲区 #1 中，在替换模式中使用 \1 反向引用时访问该缓冲区）
| - 或者
[^a-z\s] - 除 ASCII 字母（由于 /i 不区分大小写，因此匹配所有小写和大写字母）和空格之外的字符（为避免删除 Unicode 字母，请改用 \p{L} 而不是 a-z）。