Python正则表达式的不一致性

Question

Python正则表达式的不一致性

pythonregex

3

对于我找到的几个不同的正则表达式，我发现正则表达式的可选和条件部分在第一次匹配和后续匹配中的行为不同。这是使用Python，但我发现它通常适用。

以下是两个类似的示例，说明了问题：

第一个示例：

表达式：

(?:\w. )?([^,.]*).*(\d{4}\w?)

文本：

J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

匹配：

匹配1

1. wang Wang

2. 2002

匹配2

1. R

2. 2002

第二个示例：

表达式：

((?:\w\. )?[^,.]*).*(\d{4}\w?)

文本：

J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.

匹配：

匹配1

1. J. wang Wang

2. 2002

匹配2

1. R

2. 2002

我错过了什么？

我希望它的行为有所不同，我认为匹配应该是一致的。我认为它应该是（但还不明白为什么不是）：

示例1

匹配1

1. wang Wang

2. 2002

匹配2

1. wang Wang

2. 2002

示例2

匹配1

1. J. wang Wang

2. 2002

匹配2

1. R. wang Wang

2. 2002

- cjlovering

像 https://www.debuggex.com/ 这样的工具在解决正则表达式行为问题时非常有用。我建议尝试使用它。 - Shadow

@shadow 谢谢 - 我也一直在使用 pythex 和 regexr。 - cjlovering

最近我发现了一个网页，可以在使用正则表达式之前对其进行评估：regex101。 - j.barrio

你能否发布你用于匹配和报告匹配组的确切代码？那里可能有一些问题。 - Yannis P.

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marc Lambrichs · Accepted Answer

在您的第一个示例中，您期望第二行与'wang Wang'匹配。 <<example 1>>清楚地显示这并没有发生。

在第一次匹配之后，即以“2002。”结尾的匹配之后，正则表达式试图匹配剩余部分，该部分以\n\nR. wang Wang开头。在您的第一个正则表达式中，第一个非捕获组与其不匹配，因此您的第一个组接管并匹配它，最终得到'\n\nR'。

(?:                   # non-capturing group 
  \w.                 # word char, followed by 1 char, followed by space
)?                    # read 0 or 1 times      
(                     # start group 1
[^,.]*                # read anything that's not a comma or dot, 0 or more times
)                     # end group 1
.*                    # read anything 
(                     # start group 2
\d{4}                 # until there's 4 digits 
\w?                   # eventually followed by word char
)                     # end group 2

同样适用于第二个正则表达式：即使在这里，您的非捕获组(?:\w\. )?也不会消耗R。因为缩写前面有点和一些换行符。

您可以这样解决：([A-Z]\.)\s([^.,]+).*(\d{4})：请参见示例3。