我正在尝试创建一个正则表达式来限制我们的垃圾邮件接收量。问题是,我并不精通正则表达式。我的工作成果主要是复制粘贴、微调和搜索更多东西来帮助微调它。
我决定尝试使用正则表达式来匹配电子邮件地址,其中链接误导了主机名。
例如:
<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
I basically only care about the hostnames, to limit false positives and to avoid more or less legitimate links such as A HREF...>click here!
To date, I have this:
(HREF="http[s]?:\/\/)(?'hostname1'(.*?))[:|\/|"].*?\"\>(http[s]?:\/\/)(?'hostname2'(.*?))[<|\/|:]
According to https://regex101.com/ I have two named capture groups (hostname1 and hostname2), and a whack of other groups that I'm not sure I care about.
What I want to do is match the string if hostname1 and hostname2 are the same. I get the feeling that it involves either a lookbehind or a lookahead, but I honestly don't know.
EDIT: Thanks to Jan for prototyping this. I, as per the comments in his answer, made one quick addition to add the unaccounted for case of image tags. In the case of large websites (BestBuy for example) they store their images on a different content server, which was triggering the rule. I've decided to exclude image tags, which I BELIEVE (in my very non-expert opinion) I have successfully done. YMMV.
href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).?)(?:https?:\/\/)?(?!.*\k'hostname')
<a href="here" href="there">
- user557597