如何从URL中获取域名

Question

如何从URL中获取域名

64

我可以如何从URL字符串中获取域名？

示例：

+----------------------+------------+
| input                | output     |
+----------------------+------------+
| www.google.com       | google     |
| www.mail.yahoo.com   | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk        | abc        |
+----------------------+------------+

相关链接：

使用正则表达式匹配网址

- Chinmay

4

www.abc.def.ghi.au.uk是什么情况？ - Miserable Variable

2

“foo.bar.com”是什么情况？“foo.com”呢？ - Bombe

这是关于一个非常相似的话题 - 作业的分钟内的第二篇帖子？（http://stackoverflow.com/questions/568864/maching-a-web-address-through-regex） - gimpf

4

@Chinmay：你的术语使用有很多错误。你列举的所有输入都是域名，而不是URL。这是一个URL：http://en.wikipedia.org/wiki/URL，该URL中的域名是en.wikipedia.org。 - Thanatos

我发现这个答案非常有用：https://dev59.com/eFPTa4cB1Zd3GeqPmclA#4820675。 - Philipp

显示剩余4条评论

25个回答

24

有点晚了，但是:

const urls = [
  'www.abc.au.uk',
  'https://github.com',
  'http://github.ca',
  'https://www.google.ru',
  'http://www.google.co.uk',
  'www.yandex.com',
  'yandex.ru',
  'yandex'
]

urls.forEach(url => console.log(url.replace(/.+\/\/|www.|\..+/g, '')))

- Mike K

这是我最喜欢的答案。谢谢。 - Tudor

2

这个不起作用：对于输入www.mail.yahoo.co.in，期望的输出是mail.yahoo，但实际输出为mail。 - Quentin

3

没问题，接受的答案也没问题，但这种方法可扩展性更强、更具动态性。只有在特定情况下需要匹配10-20% 的内容时，这种方法可能会有所不足，那时你可以像被接受的答案那样硬编码解决。这是为社区而提供的答案，而不是为已经在11年前得到回答的提问者而提供的答案。 - Mike K

16

准确提取域名可能会很棘手，这主要是因为域名扩展名可能包含两个部分（例如.com.au或.co.uk），而子域名（前缀）可能存在，也可能不存在。列出所有的域名扩展名不是一个选项，因为其中有数百个。例如，EuroDNS.com列出了超过800个域名扩展名。

因此，我编写了一个简短的PHP函数，它使用“parse_url()”和一些关于域名扩展名的观察结果来准确提取URL组件和域名。该函数如下：

function parse_url_all($url){
    $url = substr($url,0,4)=='http'? $url: 'http://'.$url;
    $d = parse_url($url);
    $tmp = explode('.',$d['host']);
    $n = count($tmp);
    if ($n>=2){
        if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
            $d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-3)];
        } else {
            $d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-2)];
        }
    }
    return $d;
}

这个简单的函数几乎在所有情况下都有效。有一些例外，但非常罕见。

为了演示/测试这个函数，你可以使用以下方法：

$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
    $info = parse_url_all($url);
    echo "<tr><td>".$url."</td><td>".$info['host'].
    "</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";

对于所列出的URL，输出将如下所示：

正如您所看到的，无论提供给函数的URL是什么，域名和不带扩展名的域名始终会被一致地提取。

希望这能够帮助到您。

- Clinton

1

Clinton说：“因此，我编写了一个短的PHP函数，使用'parse_url()'和一些关于域名扩展的观察来准确提取URL组件和域名。” 有人有这个函数的JavaScript版本吗？ - JMichaelTX

1

好的脚本。它现在还安全可用吗？ - garry man

谢谢。我仍然在许多涉及URL和域名检查的应用程序中使用它，而且每次都能正常工作。 - Clinton

我没有PHP来测试你的代码，sub1.sub2.test.co.it在你的情况下能用吗？ - ToiletGuy

代码目前可以在域定义中使用4层。通过更改最内层的if语句，您可以轻松将其扩展到5层（如您的示例）。 - Clinton

1

这是一个很不错的小脚本，适用于95%的情况。感谢！只是想指出，如果域名长度为3个或更少的字母（例如www.cnn.com），它将失败，因此如果您只是复制和粘贴，请小心。问题在于无法确定域是否是带有"cnn.com"作为TLD的"www"还是带有"com"作为TLD的"cnn"。在这种情况下，显然很容易分辨，但您需要知道所有TLD才能确定。 - Jake

9

/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

- J.F. Sebastian

我认为这些示例只是为了说明一个普遍规则而存在。这仅适用于原帖提供的输入。 - Marko

9

有两种方法

使用 split

然后解析该字符串

var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
    domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
    domain = url.split('/')[2];
} else {
    domain = url.split('/')[0];
}

//find & remove port number
domain = domain.split(':')[0];

使用正则表达式

 var r = /:\/\/(.[^/]+)/;
 "https://dev59.com/kG435IYBdhLWcg3wkA5e".match(r)[1] 
 => stackoverflow.com

希望这能帮到你。

- Fizer Khan

这个有效，但需要在URL中添加协议。 - master_dodo

4

没有使用顶级域名列表进行比较是不可能的，因为存在许多类似 http://www.db.de/ 或 http://bbc.co.uk/ 的情况，正则表达式会将其解释为域名 db.de（正确）和 co.uk（错误）。

但即使这样，如果您的列表中不包含二级域名（SLD），也无法成功。URLs 如 https://liverpool.gov.uk.com/ 将被解释为 gov.uk.com（错误）。

因此，所有浏览器都使用 Mozilla 的公共后缀列表： https://en.wikipedia.org/wiki/Public_Suffix_List

您可以通过以下 URL 导入到您的代码中使用它： https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat

随意扩展我的函数，仅提取域名。它不使用正则表达式，速度很快： http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878

- mgutt

4

我不知道有任何相关的库，但是对于域名的字符串处理非常容易。

难点在于如何知道该名称是否位于第二级或第三级。为此，您需要维护一个数据文件（例如对于.uk并不总是第三级，一些组织（例如bl.uk，jet.uk）存在于第二级）。

Mozilla的Firefox源代码中提供了这样的数据文件，请检查Mozilla许可证以确定是否可以重用它：Firefox源代码。

- Richard

3

import urlparse

GENERIC_TLDS = [
    'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 
    'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
    ]

def get_domain(url):
    hostname = urlparse.urlparse(url.lower()).netloc
    if hostname == '':
        # Force the recognition as a full URL
        hostname = urlparse.urlparse('http://' + uri).netloc

    # Remove the 'user:passw', 'www.' and ':port' parts
    hostname = hostname.split('@')[-1].split(':')[0].lstrip('www.').split('.')

    num_parts = len(hostname)
    if (num_parts < 3) or (len(hostname[-1]) > 2):
        return '.'.join(hostname[:-1])
    if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
        return '.'.join(hostname[:-1])
    if num_parts >= 3:
        return '.'.join(hostname[:-2])

这段代码不能保证适用于所有URL，并且不会过滤那些语法正确但无效的URL，例如'example.uk'。

但是在大多数情况下，它可以完成工作。

- Juan-Pablo Scaletti

2

基本上，您想要的是：

google.com        -> google.com    -> google
www.google.com    -> google.com    -> google
google.co.uk      -> google.co.uk  -> google
www.google.co.uk  -> google.co.uk  -> google
www.google.org    -> google.org    -> google
www.google.org.uk -> google.org.uk -> google

可选：

www.google.com     -> google.com    -> www.google
images.google.com  -> google.com    -> images.google
mail.yahoo.co.uk   -> yahoo.co.uk   -> mail.yahoo
mail.yahoo.com     -> yahoo.com     -> mail.yahoo
www.mail.yahoo.com -> yahoo.com     -> mail.yahoo

您不需要构建一个随时变化的正则表达式，只需查看名称的倒数第二部分，即可正确匹配99%的域名。

(co|com|gov|net|org)

如果是这些之一，则需要匹配3个点，否则需要匹配2个点。简单吧。现在，我的正则表达式技巧不如其他SO用户的高超，所以我发现实现这一点的最佳方法是使用一些代码，假设您已经去掉了路径：

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 $dest=$d[$c-2].'.'.$d[$c-1];             # use the last 2 parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3].'.'.$dest;              # if so, add a third part
 };
 print $dest;                             # show it

根据您的问题，如果只需要获取名称：

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3];                        # if so, give the third last
   $dest=$d[$c-4].'.'.$dest if ($c>3);    # optional bit
 } else {
   $dest=$d[$c-2];                        # else the second last
   $dest=$d[$c-3].'.'.$dest if ($c>2);    # optional bit 
 };
 print $dest;                             # show it

我喜欢这种方法是因为它是无需维护的。除非您想验证它是否是合法域名，但这有点毫无意义，因为您很可能只使用它来处理日志文件，而无效的域名不会首先出现在其中。

如果您想匹配"非官方"子域名，例如bozo.za.net、bozo.au.uk、bozo.msf.ru，只需将(za|au|msf)添加到正则表达式中即可。

我很想看到有人仅使用正则表达式完成所有这些操作，我相信这是可能的。

- dagelf

1

我知道你实际上是在询问正则表达式，而不是特定的语言。但是在Javascript中，你可以像这样做。也许其他语言可以以类似的方式解析URL。

简单的Javascript解决方案

const domain = (new URL(str)).hostname.replace("www.", "");

为了完整性，请将此解决方案保留在js中。

- Ariel M.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- pi · Accepted Answer

我曾经为一家公司写过这样的正则表达式。解决方案如下：

获取可用的每个ccTLD和gTLD列表。你首先应该查看IANA。Mozilla的列表乍一看很不错，但是例如缺少ac.uk，所以它实际上并不可用。
像下面的示例一样连接列表。 警告：排序很重要！如果org.uk出现在uk之后，那么example.org.uk将匹配org而不是example。

正则表达式示例：

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

这种方法非常有效，而且还可以匹配奇怪的非官方顶级域名，比如de.com等。

好处：

如果正则表达式优化得当，速度非常快

当然，这种解决方案的缺点是：

手写的正则表达式需要手动更新，如果ccTLDs更改或添加，则非常繁琐！
非常大的正则表达式，因此不易读取。