在字符串中查找URL的正则表达式

Question

在字符串中查找URL的正则表达式

158

有没有人知道一个正则表达式可以用来在字符串中找到URL？我在Google上找到了很多用于确定整个字符串是否为URL的正则表达式，但我需要能够在整个字符串中搜索URL。例如，我想在以下字符串中找到www.google.com和http://yahoo.com：

Hello www.google.com World http://yahoo.com

我不是在寻找字符串中特定的URL地址，而是要找到字符串中的所有URL地址，因此我需要一个正则表达式。

- user758263

对于 PHP：preg_match_all('#\bhttps?://[^\s()<>]+(?:$[\w\d]+$|([^[:punct:]\s]|/))#', $string, $match); 来自 https://dev59.com/cXNA5IYBdhLWcg3wkuzO - Avatar

35个回答

67

我想说没有一种正则表达式能完美地解决这个问题。我找到了一个相当可靠的正则表达式在这里

。

/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])/igm

与此处发布的其他内容相比，一些差异/优势:

它不匹配电子邮件地址
它可以匹配 localhost:12345
如果没有 http 或 www，它不会检测到像 moo.com 这样的东西

请参阅此处的示例

- Stefan Henze

7

这不是一个有效的网址，无法匹配 www.e。 - Ihor Herasymchuk

3

g选项并非所有正则表达式实现（例如Ruby内置实现）都支持。 - Huliax

这个好了很多，谢谢！ - Cake Princess

47

text = """The link of this question: https://dev59.com/rm025IYBdhLWcg3wYE-G
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd&params2=kjhdkjshd
The code below catches all urls in text and returns urls in list."""

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-&?=%.]+', text)
print(urls)

输出：

[
    'https://dev59.com/rm025IYBdhLWcg3wYE-G', 
    'www.google.com', 
    'facebook.com',
    'http://test.com/method?param=wasd',
    'http://test.com/method?param=wasd&params2=kjhdkjshd'
]

- GooDeeJAY

Kotlin val urlRegex = "(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+" - Akshay Nandwana

2

在URL中缺少&参数。例如，http://test.com/method?param=wasd&param2=wasd2 缺少 param2。 - TrophyGeek

1

也缺乏对带有#的URL的支持。 - nicolasassi

@TrophyGeek 我认为你只是从第一条评论中复制了正则表达式，而 Akshay 忘记包含 &。正确的版本应该是：val urlRegex = "(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.[\\w/\\-&?=%.]+"。 - Alec

1

这也认为 hello... 是一个URL。 - mathematics-and-caffeine

15

我自己写了一个：

let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#\.]?[\w-]+)*\/?/gm

它适用于以下所有域名：

https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255
shop.facebook.org/derf.html

您可以在regex101上查看它的性能，并根据需要进行调整。

- wongz

你的正则表达式在我测试时错过了这个。它只捕获了URL的一部分：shop.facebook.org/derf.html - David Rector

1

@DavidRector 谢谢！你绝对是正确的。我根据你的反馈更新了正则表达式字符串和regex101网址。在倒数第二对方括号[ ]末尾添加了\。 - wongz

7

这也匹配形如 字母数字字符.字母数字字符 的任何字符串，例如 a.r、b.4、7.e 等。它们并不是有效的 URL。 - Princy

2

不幸的是，这也匹配了时间 - 09:00。 - Mike Kaply

12

这里提供的解决方案都没有解决我所遇到的问题或使用情况。

我在这里提供的是目前为止我找到/制作的最佳解决方案。如果我发现它无法处理新的边缘情况，我会进行更新。

\b
  #Word cannot begin with special characters
  (?<![@.,%&#-])
  #Protocols are optional, but take them with us if they are present
  (?<protocol>\w{2,10}:\/\/)?
  #Domains have to be of a length of 1 chars or greater
  ((?:\w|\&\#\d{1,5};)[.-]?)+
  #The domain ending has to be between 2 to 15 characters
  (\.([a-z]{2,15})
       #If no domain ending we want a port, only if a protocol is specified
       |(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])

- Squazz

1

有没有办法使这个JavaScript更友好？由于命名捕获组在那里不是完全功能的，所以协议值检查无法验证。 - einord

9

我认为这个正则表达式模式可以精确地处理你想要的内容。

/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

这是一个提取URL的代码示例：

// The Regular Expression filter
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";

// The Text you want to filter for urls
$text = "The text you want  https://dev59.com/rm025IYBdhLWcg3wYE-G to filter goes here.";

// Check if there is a url in the text
preg_match_all($reg_exUrl, $text, $url,$matches);
var_dump($matches);

- Yuseferi

8

如果你在选择链接时必须严格要求，我会选择：

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

更多信息请阅读以下内容：

使用改进后的宽松、准确的正则表达式模式匹配URL

- Tommaso Belluzzo

5

不要那样做。这样会导致你的应用崩溃...请参考http://www.regular-expressions.info/catastrophic.html。 - Auric

7

以上所有答案都无法匹配URL中的Unicode字符，例如：http://google.com?query=đức+filan+đã+search 解决方案如下：

(ftp:\/\/|www\.|https?:\/\/){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}\.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)

- Duc Filan

2

根据URL的RFC 1738规定，Unicode字符是被禁止的。为了符合标准，它们必须进行百分号编码 - 尽管我认为最近可能已经有所改变 - 值得阅读https://www.w3.org/International/articles/idn-and-iri/。 - mrswadge

@mrswadge 我只是涵盖了这些情况。我们不确定所有人是否关心标准。感谢您的信息。 - Duc Filan

1

只有这一个对我完美地工作，其中包括以下网址： "http://www.example.com" "www.exmaple.com" "https://example.com" "ftp://example.co.in" "http://www.exmaple.com/?q='me'" - Krissh

6

我使用以下正则表达式在字符串中查找URL：

/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

- aditya

3

[a-zA-Z]{2,3} 对于匹配顶级域名来说实在太粗糙了，请参考官方列表：https://data.iana.org/TLD/tlds-alpha-by-domain.txt。 - Toto

6

我找到了这个，其中包含大多数示例链接，包括子目录部分。

正则表达式为：

(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))?

- Thilanka Bowala

当我尝试这个时，句子的结尾被标记为匹配。在上面的句子中，最后一个单词“匹配”和句号被匹配。 - David Rector

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rajeev · Accepted Answer

292

这是我使用的那一个

(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])

对我有效，也应该对你有效。

- Rajeev

12

别忘了转义正斜杠。 - Mark

3

现在已经是2017年了，Unicode域名随处可见。\w可能无法匹配国际符号（取决于正则表达式引擎），因此需要使用范围：a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF。 - Michael Antipin

5

这对于一般目的来说是可以的，但有许多情况它无法捕捉到。这强制要求你的链接以协议为前缀。如果选择忽略协议，则电子邮件的结尾会被接受，就像 test@testing.com 的情况一样。 - Squazz

7

不应该将 [\w_-] 改为 [\w-] 吗？因为 \w 已经匹配了 _。参考 mozilla 文档。 - transang

11

点赞了，但这个答案不能解决问题，问题是要匹配网址"www.yahoo.com"，而不是"www.google.com"。另外，答案也缺少解释。 - prayagupa

显示剩余9条评论