正则表达式:如何从最后一个括号中提取文本

4
什么是正确的正则表达式,可以从以下字符串中提取字符串“(procedure)”或括号内的一般文本?
输入字符串示例如下:
Positron emission tomography using flutemetamol (18F) with computed tomography of brain (procedure)
另一个例子:
Urinary tract infection prophylaxis (procedure)
可能的方法有:
- 到达文本末尾,查找第一个开括号,并将该位置到文本末尾作为子集。 - 从文本开头,识别最后一个“(”字符,并将该位置到结尾作为子字符串。
其他字符串可以是(不同的“标记”被提取)。
[1] "Xanthoma of eyelid (disorder)"                    "Ventricular tachyarrhythmia (disorder)"          
[3] "Abnormal urine odor (finding)"                    "Coloboma of iris (disorder)"                     
[5] "Macroencephaly (disorder)"                        "Right main coronary artery thrombosis (disorder)"

寻求通用正则表达式(或者更好的解决方案是在R中)

3个回答

5
如果是字符串的最后一部分,那么这个正则表达式可以实现此功能:
/\(([^()]*)\)$/

说明:寻找一个开放的 (,匹配两者之间的所有内容,但不包括 () 并且该字符串以 ) 结尾。 https://regex101.com/r/cEsQtf/1

我一直在寻找这样的解决方案。即使有多个迭代在其前面,它也可以成功匹配最后一组括号。优雅的解决方案! - user1239087
这个解决方案对我很有效,但我遇到了另一种情况,有时我想保留最后一组括号内部的另一个括号。这个可以工作:FELON IN POSSESSION OF AMMUNITION (ACTUAL POSSESSION) (79023) 得到 79023。 这个不行:FAIL TO DISPLAY REGISTRATION - POSSESSION REQUIRED (320.0605(1))。它应该给出 320.0605(1)。有什么办法修改这个答案以允许嵌套括号吗? - OscarVanL
1
@OscarVanL 我在我的回答中解释了如何做到这一点。 - Wiktor Stribiżew

4

使用正确的正则表达式,子字符串可以实现此操作。

Text = c("Positron emission tomography using flutemetamol (18F) 
    with computed tomography of brain (procedure)",
    "Urinary tract infection prophylaxis (procedure)", 
    "Xanthoma of eyelid (disorder)",                    
    "Ventricular tachyarrhythmia (disorder)",          
    "Abnormal urine odor (finding)",                    
    "Coloboma of iris (disorder)",                   
    "Macroencephaly (disorder)",                        
    "Right main coronary artery thrombosis (disorder)")
sub(".*\\((.*)\\).*", "\\1", Text)
[1] "procedure" "procedure" "disorder"  "disorder"  "finding"   "disorder" 
[7] "disorder"  "disorder"

补充说明:正则表达式的详细解释
这个问题要求找到字符串中最后一组括号的内容。这个表达式有点混乱,因为它包括了两种不同的括号用法。一种是表示正在处理的字符串中的括号,另一种是设置一个“捕获组”,以指定表达式应该返回哪一部分。该表达式由五个基本单元组成:

1. Initial .*   - matches everything up to the final open parenthesis. 
   Note that this is relying on "greedy matching"
2. \\(   ...    \\)   - matches the final set of parentheses. 
   Because ( by itself means something else,  we need to "escape" the 
   parentheses by preceding them with \.  That is we want the regular
   expression to say   \(  ...  \).  However, the way R interprets strings,
   if we just typed \( and \),  R would interpret the \ as escaping the (
   and so interpret this as just ( ... ).  So we escape the backslash.  
   R will interpret   \\(  ... \\)      as \( ... \) meaning the literal
   characters ( & ). 
3. ( ... )       Inside the pair in part 2
   This is making use of the special meaning of parentheses.  When we
   enclose an expression in parentheses, whatever value is inside them 
   will be stored in a variable for later use. That variable is called 
   \1,  which is what was used in the substitution pattern. Again, is 
   we just wrote \1, R would interpret it as if we were trying to escape
   the 1. Writing \\1 is interpreted as the character \ followed by 1, 
   i.e. \1.
4. Central .*    Inside the pair in part 3
   This is what we are looking for,  all characters inside the parentheses.
5. Final   .*
   This is in the expression to match any characters that may follow the 
   final set of parentheses. 

这个子函数将使用它来替换匹配的模式(在此情况下,字符串中的所有字符)与替换模式\1,即包含第一个(在我们的例子中仅有的)捕获组中的内容的变量的内容 - 最后一对括号中的内容。


你能否对这个解决方案进行注释?我认为 \1 是正则表达式中某些定义元素的引用。它可以工作,但更好的理解它是如何工作的。 - userJT
@userJT - 已添加至答案。 - G5W

2
你可以使用以下方法提取字符串末尾嵌套括号中的文本:
x <- c("FELON IN POSSESSION OF AMMUNITION (ACTUAL POSSESSION) (79023)",
"FAIL TO DISPLAY REGISTRATION - POSSESSION REQUIRED (320.0605(1))")
sub(".*(\\(((?:[^()]++|(?1))*)\\))$", "\\2", x, perl=TRUE)

请查看在线R演示正则表达式演示
详细信息:
* `.*` - 除了换行符以外的任意零个或多个字符,尽可能多地匹配。 * `(\(((?:[^()]++|(?1))*)\))` - 捕获组1(递归必需): * `\(` - 一个左括号 * `((?:[^()]++|(?1))*)` - 捕获组2(值): 零个或多个出现,除了左右括号之外的任何一个或多个字符,或整个捕获组1的模式。 * `\)` - 一个右括号 * `$` - 字符串结尾。
当匹配时,整个字符串将被替换为捕获组2的值。如果没有匹配,则保持原样。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接