从域名中提取二级域名？- Python

Question

从域名中提取二级域名？- Python

javascriptjquerypythonhtmldjango

8

我有一个域名列表，例如：

site.co.uk
site.com
site.me.uk
site.jpn.com
site.org.uk
site.it

而且这些域名包含第三和第四级域名，例如：

test.example.site.org.uk
test2.site.com

我需要尝试提取第二级域名，对于所有这些情况应为site

有什么想法？ :)

- RadiantHex

请把下面这段关于编程的英文内容翻译成中文：与以下相似的问题：https://dev59.com/fHNA5IYBdhLWcg3wKai3 - Tomasz Zieliński

6个回答

6

根据@kohlehydrat的建议：

import urllib2

class TldMatcher(object):
    # use class vars for lazy loading
    MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1"
    TLDS = None

    @classmethod
    def loadTlds(cls, url=None):
        url = url or cls.MASTERURL

        # grab master list
        lines = urllib2.urlopen(url).readlines()

        # strip comments and blank lines
        lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//']

        cls.TLDS = set(lines)

    def __init__(self):
        if TldMatcher.TLDS is None:
            TldMatcher.loadTlds()

    def getTld(self, url):
        best_match = None
        chunks = url.split('.')

        for start in range(len(chunks)-1, -1, -1):
            test = '.'.join(chunks[start:])
            startest = '.'.join(['*']+chunks[start+1:])

            if test in TldMatcher.TLDS or startest in TldMatcher.TLDS:
                best_match = test

        return best_match

    def get2ld(self, url):
        urls = url.split('.')
        tlds = self.getTld(url).split('.')
        return urls[-1 - len(tlds)]


def test_TldMatcher():
    matcher = TldMatcher()

    test_urls = [
        'site.co.uk',
        'site.com',
        'site.me.uk',
        'site.jpn.com',
        'site.org.uk',
        'site.it'
    ]

    errors = 0
    for u in test_urls:
        res = matcher.get2ld(u)
        if res != 'site':
            print "Error: found '{0}', should be 'site'".format(res)
            errors += 1

    if errors==0:
        print "Passed!"
    return (errors==0)

- Hugh Bothwell

5

使用Python TLD https://pypi.python.org/pypi/tld 使用 $ pip install tld 命令进行安装。

from tld import get_tld, get_fld

print(get_tld("http://www.google.co.uk"))
'co.uk'

print(get_fld("http://www.google.co.uk"))
'google.co.uk'

- Artur Barseghyan

3

混合提取1级和2级的问题。

简单的解决方案...

建立可能的站点后缀列表，按照从狭窄到常见的顺序排列。 "co.uk", "uk", "co.jp", "jp", "com"

然后检查，站点后缀是否可以匹配域名的末尾。如果匹配，则下一部分是站点。

- mmv-ru

2

唯一可能的方式是通过包含所有顶级域名（例如.com或.co.uk）的列表。然后你需要扫描这个列表并检查。我没有看到其他方法，至少没有在运行时访问互联网的方法。

- kohlehydrat

1

即使在运行时访问互联网，您仍需要该列表。出售三级域名或二级域名给最终用户的决定由CCTLD的权威机构做出。我认为有些人甚至保留了一些二级域名，在那些地方出售三级域名和其他二级域名。当然，您还需要维护该列表，因为这些事情会发生变化（而且这还没有考虑到新的CCTLD被创建）。 - Quentin

谢谢！你有任何想法在哪里可以获取列表吗？感觉像是不可能完成的任务:S - RadiantHex

1

@Hugh Bothwell

在您的示例中，您没有处理像 parliament.uk 这样的特殊域名，它们在文件中用 "!" 表示（例如 !parliament.uk）。

我对您的代码进行了一些更改，使其看起来更像我之前使用的 PHP 函数。

还添加了从本地文件加载数据的可能性。

还测试了一些域名，例如：

niki.bg、niki.1.bg
parliament.uk
niki.at、niki.co.at
niki.us、niki.ny.us
niki.museum、niki.national.museum
www.niki.uk - 由于 Mozilla 文件中的 "*"，这被报告为 OK。

请随时通过 github 联系我，以便我可以将您添加为共同作者。

GitHub 存储库在此处：

https://github.com/nmmmnu/TLDExtractor/blob/master/TLDExtractor.py

- Nick

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CrayonViolent · Accepted Answer

无法可靠地获取它。子域是任意的，而且每天都会有一个庞大的域名扩展列表。最好的情况是您检查域名扩展的庞大列表并维护该列表。

列表： http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1