如何从URL中提取顶级域名（TLD）

Question

如何从URL中提取顶级域名（TLD）

69

如何从URL中提取域名，不包括任何子域名？

我的最初简单的尝试是：

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

这对于http://www.foo.com有效，但对于http://www.foo.com.au无效。有没有一种正确的方法来做到这一点而不使用关于有效顶级域名(TLD)或国家代码的特殊知识（因为它们会改变）。

谢谢

- hoju

2

之前在 Stack Overflow 上有一个相关的问题: https://dev59.com/InRB5IYBdhLWcg3wn4fb - Conspicuous Compiler

1

+1：对于我来说，这个问题中的“简单尝试”很有效，即使它讽刺地对作者本人没有用。 - ArtOfWarfare

类似问题：https://dev59.com/cmYq5IYBdhLWcg3whRHV - user2314737

8个回答

54

没有一种固有的方法可以知道（例如）zap.co.it是子域名（因为意大利的注册机构确实出售像co.it这样的域名），而zap.co.uk则不是（因为英国的注册机构并不出售像co.uk这样的域名，只出售像zap.co.uk这样的域名）。

您只需要使用辅助表格（或在线资源）来告诉您哪些顶级域名(TLD)像英国和澳大利亚那样行事异常——除非拥有这种额外的语义知识，否则没有办法仅通过查看字符串来推断（当然可能会变化，但如果您能找到一个好的在线资源，那么该资源也会相应地更改，希望如此！）。

- Alex Martelli

42

使用这个有效顶级域名文件（由Mozilla网站上的其他人发现）：

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

得到的结果为：

abcde.co.uk

如果有人能告诉我如何以更Pythonic的方式重写上面的内容，我将不胜感激。例如，迭代last_i_elements列表可能有更好的方法，但我想不出来。我也不确定ValueError是否是最合适的异常。有什么建议吗？

- Markus

10

如果您在实践中需要经常调用getDomain()，例如从大型日志文件中提取域名，我建议您将tlds设置为集合，例如tlds = set([line.strip() for line in tldFile if line[0] not in "/\n"])。这样可以为每个检查项提供常数时间查找，以确定某些项目是否在tlds中。对于查找（集合 vs. 列表），我看到了约1500倍的加速，并且对于从约2000万行日志文件中提取域的整个操作，加速了约60倍（从6小时缩短到6分钟）。 - Bryce Thomas

1

太棒了！只有一个问题：effective_tld_names.dat文件是否也会更新以适应新的域名，例如.amsterdam、.vodka和.wtf？ - kramer65

Mozilla公共后缀列表得到定期维护，现在有多个包含它的Python库。请参见http://publicsuffix.org/和本页面上的其他答案。 - tripleee

一些关于2021年正确操作的更新：文件现在被称为public_suffix_list.dat，如果您不指定Python应该将文件读取为UTF8，则Python会发出警告。请明确指定编码方式：with open("public_suffix_list.dat", encoding="utf8") as tld_file。 - Andrei

42

使用 Python tld

https://pypi.python.org/pypi/tld

安装

pip install tld

从给定的URL中获取顶级域名并以字符串形式返回

from tld import get_tld
print get_tld("http://www.google.co.uk")

英国商业

或不带协议

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

将TLD作为对象获取

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

从给定的 URL 获取第一级域名字符串

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

- Artur Barseghyan

2

这将随着新的通用顶级域名变得更加不可靠。 - Sjaak Trekhaak

1

嘿，感谢您指出这一点。我想，当新的 gTLD 实际被使用时，适当的修复可能会进入“tld”包中。 - Artur Barseghyan

3

如上所述，如果gTLDs被广泛使用并且可能存在问题，则会提供适当的修复方法。与此同时，如果你非常关注gTLDs，你可以始终捕获“tld.exceptions.TldDomainNotFound”异常，并继续进行任何你想做的事情，即使域名没有被找到。 - Artur Barseghyan

1

是我自己的问题还是tld.get_tld()实际上返回的是完全合格的域名，而不是顶级域名？ - Marian

get_tld("http://www.google.co.uk", as_object=True).extension 会打印出 "co.uk"。 - Artur Barseghyan

显示剩余3条评论

2

有很多顶级域名（TLD）。以下是列表：

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

这是另一个列表。

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

这是另一个列表。

http://www.iana.org/domains/root/db/

- S.Lott

1

那并没有帮助，因为它没有告诉你哪些有“额外级别”的，比如 co.uk。 - Lennart Regebro

Lennart：这有帮助，你可以将它们包装成正则表达式中的可选项。 - lprsd

0

在 get_tld 更新所有新的顶级域名之前，我从错误中获取 tld。虽然这是糟糕的代码，但它能够工作。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

- Russ Savage

-1

这是我处理它的方式：

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

- Ryan Buckley

3

有一个叫做.travel的域名，它与上述代码不兼容。 - Sri

-1

在Python中，我过去常使用tldextract直到它无法解析类似www.mybrand.sa.com的URL为subdomain='order.mybrand'，domain='sa'，suffix='com' ！！

因此，最终我决定编写这个方法

重要说明：此方法仅适用于具有子域的URL。这并不意味着要取代更高级的库，如tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

- Korayem

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Acorn · Accepted Answer

这里有一个很棒的Python模块，是有人在看到这个问题后编写的，用来解决这个问题： https://github.com/john-kurkowski/tldextract

该模块在由Mozilla志愿者维护的Public Suffix List中查找TLD（顶级域名）。

引用：

tldextract则另一方面知道所有gTLD（通用顶级域名）和ccTLD（国家代码顶级域名）的形式，通过查找Public Suffix List上当前正在使用的域名。因此，给定一个URL，它可以从其域名中识别出其子域名和其国家代码。