正则表达式 - 比较两个捕获组

Question

正则表达式 - 比较两个捕获组

4

我正在尝试创建一个正则表达式来限制我们的垃圾邮件接收量。问题是，我并不精通正则表达式。我的工作成果主要是复制粘贴、微调和搜索更多东西来帮助微调它。

我决定尝试使用正则表达式来匹配电子邮件地址，其中链接误导了主机名。

例如：

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>

I basically only care about the hostnames, to limit false positives and to avoid more or less legitimate links such as A HREF...>click here!

To date, I have this:

(HREF="http[s]?:\/\/)(?'hostname1'(.*?))[:|\/|"].*?\"\>(http[s]?:\/\/)(?'hostname2'(.*?))[<|\/|:]

According to https://regex101.com/ I have two named capture groups (hostname1 and hostname2), and a whack of other groups that I'm not sure I care about.

What I want to do is match the string if hostname1 and hostname2 are the same. I get the feeling that it involves either a lookbehind or a lookahead, but I honestly don't know.

EDIT: Thanks to Jan for prototyping this. I, as per the comments in his answer, made one quick addition to add the unaccounted for case of image tags. In the case of large websites (BestBuy for example) they store their images on a different content server, which was triggering the rule. I've decided to exclude image tags, which I BELIEVE (in my very non-expert opinion) I have successfully done. YMMV.

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).?)(?:https?:\/\/)?(?!.*\k'hostname')

- Networking Guy

这只涉及到一个反向引用。请参见此处。但是，您可能想切换到HTML解析器来解析您的HTML。 - Wiktor Stribiżew

我不是HTML专家，但当解析器遇到重复的属性时会发生什么呢？<a href="here" href="there"> - user557597

更具体地说，我们的垃圾邮件过滤解决方案允许我们根据多个标准对电子邮件（或其他事物，如接受/拒绝等）进行评分。其中之一是我计划使用的“原始正文”“匹配正则表达式”<regex>。不幸的是，这样做会排除使用解析器的可能性。 - Networking Guy

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jan · Accepted Answer

这有点取决于你所使用的编程语言。在PHP中，你可以想出类似这样的代码：

href=["']https?:\/\/(?<hostname>[^\/]+)[^>]+>(?:https?:\/\/)?\k'hostname'
# match href, =, a single/double quote, :// literally
# capture everything up to a forward slash (but not including) in a group called hostname
# followed by anything but >
# followed by >
# start a non capturing group (?:) with http/https://
# look if one can match the previously captured group called hostname

如果是这种情况，那么很可能不是垃圾链接（href和链接文本相匹配）。

一个概述：

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
<a href="https://example.com/subfolder">example.com</a> <-- will match, the others not
<a href="http://somebadsite.com">https://somegoodsite.com</a>

在regex101.com上可以看到一个工作示例。

编辑：根据您的评论，您想要负面结果，这可以通过负向先行断言来实现：

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>(?:https?:\/\/)?(?!.*\k'hostname')
# same as before, except for the last part: (?!...)
# this one assures that the following group (hostname in our case) is not matched

在这里查看此正则表达式的工作示例here。