使用 Python 从文本文件中提取 FQDN(完全限定域名)

3

我正在尝试创建一个Python脚本,从URL列表中下载文本文件,并将它们连接到一个单独的文件中。这是我的代码:

import urllib
import urllib.request
import re

with open("blocklist_urls.txt", "r") as a:
    urls = a.readlines()

retrieved_pages = []
for url in urls:
    retrieved_pages.append(urllib.request.urlopen(url).read())

with open('blocklist_raw.txt', 'w') as f:
   for page in retrieved_pages:
       sys.stdout = f
       decoded_line = page.decode("utf-8")
       print(decoded_line)
       sys.stdout = original_stdout

它确实能够从我在文本文件中列出的所有URL中获取文本文件,而且没有错误。它会创建blocklist_raw.txt文件,并以一种格式化的方式包含几个屏蔽列表:
# AdAway default blocklist
# Blocking mobile ad providers and some analytics providers
#
# Project home page:
# https://github.com/AdAway/adaway.github.io/
#
# Fetch the latest version of this file:
# https://raw.githubusercontent.com/AdAway/adaway.github.io/master/hosts.txt
#
# License:
# CC Attribution 3.0 (http://creativecommons.org/licenses/by/3.0/)
#
# Contributions by:
# Kicelo, Dominik Schuermann.
# Further changes and contributors maintained in the commit history at
# https://github.com/AdAway/adaway.github.io/commits/master
#
# Contribute:
# Create an issue at https://github.com/AdAway/adaway.github.io/issues
#



# [163.com]
127.0.0.1 analytics.163.com
127.0.0.1 crash.163.com
127.0.0.1 crashlytics.163.com
127.0.0.1 iad.g.163.com

# [1mobile.com]
127.0.0.1 ads.1mobile.com
127.0.0.1 api4.1mobile.com

# [1rx.io]
127.0.0.1 sync.1rx.io
127.0.0.1 tag.1rx.io

# [206ads.com]
127.0.0.1 s.206ads.com

# [247-inc.net]
127.0.0.1 api.247-inc.net
127.0.0.1 tie.247-inc.net

nav.booksonlineclub.com
navi.businessconsults.net
navi.earthsolution.org
nci.bigdepression.net
nci.dnsweb.org
nci.safalife.com
ncih.dnsweb.org
ncsc.businessconsults.net
ne.hugesoft.org
nes.nationtour.net
net.firefoxupdata.com
net.infosupports.com
new.arrowservice.net
new.booksonlineclub.com
new.firefoxupdata.com
new.globalowa.com
newport.bigdepression.net
newport.infosupports.com
newport.safalife.com
news.advanbusiness.com
news.aoldaily.com
news.aolon1ine.com

# Oldest record: 2021-09-03T02:06:30+02:00
# Number of source websites: 873
# Number of source subdomains: 1990306
# Number of source DNS records: ~2E9 + 1298142
#
# Input rules: asns: 6, zones: 48
# Subsequent rules: asns: 6, hostnames: 122196, ip4s: 64, zones: 48
# … no duplicates: asns: 6, hostnames: 89794, zones: 48
# Output rules: hostnames: 122196
#

0.0.0.0 0001.ya-man.com
0.0.0.0 0002.onlyminerals.jp
0.0.0.0 000.affex.org

# Title: NoTrack Malware Blocklist
# Description: Domains classified as malware, phishing or adware
# Author: QuidsUp
# License: GNU General Public License v3.0
# Home: https://quidsup.net/notrack/blocklist.php
# @ GitLab : https://gitlab.com/quidsup/notrack-blocklists
# Updated: 08 Sep 2021
#LatestVersion 21.08
# Domain Count: 348
#===============================================================

2track.info #Adware - Malware
4dsply.com #Adware - Malware
acountscr.cool #Adware Lnkr - Malware
ad2up.com #Adware - Malware
adaranth.com #Adware - Malware
adbigline.network #Malware - Malware
addr.cx #Adware Lnkr - Malware
adfuture.cn #Android Trojan - Malware
adsunflower.com #Android Trojan - Malware
adultsonly.pro #Generic - Malware

有没有一种方法可以仅保留blocklist_raw.txt中的FQDN并删除所有其他文本?非常感谢任何指引。谢谢。

编辑:这是我到目前为止的代码。我以前从未编写过Python,所以很多可能并不是很清楚:

#!/usr/bin/env python3

import urllib
import urllib.request
import re
import sys

original_stdout = sys.stdout

def removeStr(val):
    if val.count('.') >= 2:
         if val.count('/') <= 0:
              return val

with open("blocklist_urls.txt", "r") as a:
    urls = a.readlines()

retrieved_pages = []

for url in urls:
    retrieved_pages.append(urllib.request.urlopen(url).read())

for page in retrieved_pages:
    decoded_line = page.decode("utf-8")

each_line = "\n".join(filter(removeStr, decoded_line.split()))

urls_filtered_raw = each_line.replace('0.0.0.0', '\b').replace('127.0.0.1', '\b')

with open('blocklist_raw.txt', 'w') as b:
    for page in retrieved_pages:
        sys.stdout = b
        print(urls_filtered_raw.rstrip("\n").rstrip("^H"))
        sys.stdout = original_stdout

links = set()

with open('blocklist_raw.txt', 'r') as fp:
    for line in fp.readlines():
        links.add(line)

with open('blocklist_raw.txt', 'w') as fp:
    for line in links:
        fp.write(line)

sorted_urls_raw = open('blocklist_raw.txt', 'r')
sorted_urls_list = sorted_urls_raw.readlines

split_hosts = []
for h in sorted_urls_raw:
    segments = h.split('.')
    segments.reverse()
    split_hosts.append(segments)

split_hosts.sort()
for segments in split_hosts:
    segments.reverse()
    print(".".join(segments))


我想找到一种方法按字母顺序对输出进行排序,并将结果写回文件中。谢谢大家。
2个回答

0

这是一种方法:

    fqdns_re = re.compile(
        r'^(([a-zA-Z]{1})|([a-zA-Z]{1}[a-zA-Z]{1})|'
        r'([a-zA-Z]{1}[0-9]{1})|([0-9]{1}[a-zA-Z]{1})|'
        r'([a-zA-Z0-9][-_.a-zA-Z0-9]{0,61}[a-zA-Z0-9]))\.'
        r'([a-zA-Z]{2,13}|[a-zA-Z0-9-]{2,30}.[a-zA-Z]{2,3})$'
    )
    splits_re = re.compile(r'[#\s/]')

    def match(word):
        m = fqdns_re.match(word)
        if m:
            return m.group(0)

    with open('/tmp/blocklist_raw.txt') as f:
        fqdns = [word for row in f.readlines()
                    for word in splits_re.split(row) if match(word)]
    print(sorted(fqdns))

返回:

['000.affex.org', '0001.ya-man.com', '0002.onlyminerals.jp', '2track.info', '4dsply.com', 'acountscr.cool', 'ad2up.com', 'adaranth.com', 'adaway.github.io', 'adaway.github.io', 'adaway.github.io', 'adaway.github.io', 'adbigline.network', 'addr.cx', 'adfuture.cn', 'ads.1mobile.com', 'adsunflower.com', 'adultsonly.pro', 'analytics.163.com', 'api.247-inc.net', 'api4.1mobile.com', 'blocklist.php', 'crash.163.com', 'crashlytics.163.com', 'creativecommons.org', 'github.com', 'github.com', 'github.com', 'gitlab.com', 'hosts.txt', 'iad.g.163.com', 'nav.booksonlineclub.com', 'navi.businessconsults.net', 'navi.earthsolution.org', 'nci.bigdepression.net', 'nci.dnsweb.org', 'nci.safalife.com', 'ncih.dnsweb.org', 'ncsc.businessconsults.net', 'ne.hugesoft.org', 'nes.nationtour.net', 'net.firefoxupdata.com', 'net.infosupports.com', 'new.arrowservice.net', 'new.booksonlineclub.com', 'new.firefoxupdata.com', 'new.globalowa.com', 'newport.bigdepression.net', 'newport.infosupports.com', 'newport.safalife.com', 'news.advanbusiness.com', 'news.aoldaily.com', 'news.aolon1ine.com', 'quidsup.net', 'raw.githubusercontent.com', 's.206ads.com', 'sync.1rx.io', 'tag.1rx.io', 'tie.247-inc.net']

0

最简单的方法可能是使用IANA顶级域名数据库(.com,.org,.net等)。使用此列表创建一个正则表达式模式,以查找所有与类似“*.tld”的字符串匹配的内容:

# Additional import
import re

# Get TLD database
resp = urllib.request.urlopen('http://data.iana.org/TLD/tlds-alpha-by-domain.txt')

# Create a reverse sorted list of TLD ('.com' must be before '.co')
tld = sorted([tld.strip().lower().decode('utf-8')
                  for tld in resp.readlines()[1:]], reverse=True)

# Compile the regex pattern
FQDN = re.compile(fr"([^\s]*\.(?:{'|'.join(tld)}))")


# Find all fqdn
with open('blocklist_raw.txt') as fp:
    fqdn_list = []
    for line in fp.readlines():
        line = line.strip().lower()

        # Remove comments and blank lines
        if (len(line) == 0) or line.startswith('#'):
            continue

        # Extract FQDN
        fqdn = FQDN.findall(line)
        if fqdn:
            fqdn_list.append(fqdn[0])

输出:

>>> fqdn_list
['analytics.163.com',
 'crash.163.com',
 'crashlytics.163.com',
 'iad.g.163.com',
 'ads.1mobile.com',
 'api4.1mobile.com',
 'sync.1rx.io',
 'tag.1rx.io',
 's.206ads.com',
 'api.247-inc.net',
 'tie.247-inc.net',
 'nav.booksonlineclub.com',
 'navi.businessconsults.net',
 'navi.earthsolution.org',
 'nci.bigdepression.net',
 'nci.dnsweb.org',
 'nci.safalife.com',
 'ncih.dnsweb.org',
 'ncsc.businessconsults.net',
 'ne.hugesoft.org',
 'nes.nationtour.net',
 'net.firefoxupdata.com',
 'net.infosupports.com',
 'new.arrowservice.net',
 'new.booksonlineclub.com',
 'new.firefoxupdata.com',
 'new.globalowa.com',
 'newport.bigdepression.net',
 'newport.infosupports.com',
 'newport.safalife.com',
 'news.advanbusiness.com',
 'news.aoldaily.com',
 'news.aolon1ine.com',
 '0001.ya-man.com',
 '0002.onlyminerals.jp',
 '000.affex.org',
 '2track.info',
 '4dsply.com',
 'acountscr.cool',
 'ad2up.com',
 'adaranth.com',
 'adbigline.network',
 'addr.cx',
 'adfuture.cn',
 'adsunflower.com',
 'adultsonly.pro']

我不是正则表达式的专家,但我认为这是一个很好的起点。 - Corralien

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接