正则表达式 - 比较两个捕获组

4

我正在尝试创建一个正则表达式来限制我们的垃圾邮件接收量。问题是,我并不精通正则表达式。我的工作成果主要是复制粘贴、微调和搜索更多东西来帮助微调它。

我决定尝试使用正则表达式来匹配电子邮件地址,其中链接误导了主机名。

例如:

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>

I basically only care about the hostnames, to limit false positives and to avoid more or less legitimate links such as A HREF...>click here!

To date, I have this:

(HREF="http[s]?:\/\/)(?'hostname1'(.*?))[:|\/|"].*?\"\>(http[s]?:\/\/)(?'hostname2'(.*?))[<|\/|:]

According to https://regex101.com/ I have two named capture groups (hostname1 and hostname2), and a whack of other groups that I'm not sure I care about.

What I want to do is match the string if hostname1 and hostname2 are the same. I get the feeling that it involves either a lookbehind or a lookahead, but I honestly don't know.

EDIT: Thanks to Jan for prototyping this. I, as per the comments in his answer, made one quick addition to add the unaccounted for case of image tags. In the case of large websites (BestBuy for example) they store their images on a different content server, which was triggering the rule. I've decided to exclude image tags, which I BELIEVE (in my very non-expert opinion) I have successfully done. YMMV.

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).?)(?:https?:\/\/)?(?!.*\k'hostname')


这只涉及到一个反向引用。请参见此处。但是,您可能想切换到HTML解析器来解析您的HTML。 - Wiktor Stribiżew
我不是HTML专家,但当解析器遇到重复的属性时会发生什么呢?<a href="here" href="there"> - user557597
更具体地说,我们的垃圾邮件过滤解决方案允许我们根据多个标准对电子邮件(或其他事物,如接受/拒绝等)进行评分。其中之一是我计划使用的“原始正文”“匹配正则表达式”<regex>。不幸的是,这样做会排除使用解析器的可能性。 - Networking Guy
1个回答

1

这有点取决于你所使用的编程语言。在PHP中,你可以想出类似这样的代码:

href=["']https?:\/\/(?<hostname>[^\/]+)[^>]+>(?:https?:\/\/)?\k'hostname'
# match href, =, a single/double quote, :// literally
# capture everything up to a forward slash (but not including) in a group called hostname
# followed by anything but >
# followed by >
# start a non capturing group (?:) with http/https://
# look if one can match the previously captured group called hostname

如果是这种情况,那么很可能不是垃圾链接(href和链接文本相匹配)。
一个概述:
<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
<a href="https://example.com/subfolder">example.com</a> <-- will match, the others not
<a href="http://somebadsite.com">https://somegoodsite.com</a>

在regex101.com上可以看到一个工作示例

编辑:根据您的评论,您想要负面结果,这可以通过负向先行断言来实现:

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>(?:https?:\/\/)?(?!.*\k'hostname')
# same as before, except for the last part: (?!...)
# this one assures that the following group (hostname in our case) is not matched

在这里查看此正则表达式的工作示例here


这非常接近我所需要的 - 我正在尝试否定这种结果的方式,因为我需要不匹配的结果而不是匹配的结果。 这样它就会在错误的电子邮件上触发,并将分数应用于我想要保留的那些电子邮件。 - Networking Guy
@NetworkingGuy,请看我的更新答案,你需要一个负向先行断言。 - Jan
Jan,谢谢。我在各个地方放置了各种!,但似乎没有起作用。看起来需要更多的括号。 - Networking Guy
Jan,我发现了一个特定的用例,我没有考虑到。我相信我现在已经考虑到了。将图像标记作为链接的“body”。href=["']https?://(?<hostname>[^/"]+)[^>]+>((?!<IMG).)(?:https?://)?(?!.*\k'hostname') - Networking Guy
@Jan 谢谢,真的很有帮助。在VIM中是否有任何(替代)方法来实现相同的功能?不幸的是,没有\k... - daflodedeing

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接