我正在尝试创建一个Python脚本,从URL列表中下载文本文件,并将它们连接到一个单独的文件中。这是我的代码:
import urllib
import urllib.request
import re
with open("blocklist_urls.txt", "r") as a:
urls = a.readlines()
retrieved_pages = []
for url in urls:
retrieved_pages.append(urllib.request.urlopen(url).read())
with open('blocklist_raw.txt', 'w') as f:
for page in retrieved_pages:
sys.stdout = f
decoded_line = page.decode("utf-8")
print(decoded_line)
sys.stdout = original_stdout
它确实能够从我在文本文件中列出的所有URL中获取文本文件,而且没有错误。它会创建blocklist_raw.txt文件,并以一种格式化的方式包含几个屏蔽列表:
# AdAway default blocklist
# Blocking mobile ad providers and some analytics providers
#
# Project home page:
# https://github.com/AdAway/adaway.github.io/
#
# Fetch the latest version of this file:
# https://raw.githubusercontent.com/AdAway/adaway.github.io/master/hosts.txt
#
# License:
# CC Attribution 3.0 (http://creativecommons.org/licenses/by/3.0/)
#
# Contributions by:
# Kicelo, Dominik Schuermann.
# Further changes and contributors maintained in the commit history at
# https://github.com/AdAway/adaway.github.io/commits/master
#
# Contribute:
# Create an issue at https://github.com/AdAway/adaway.github.io/issues
#
# [163.com]
127.0.0.1 analytics.163.com
127.0.0.1 crash.163.com
127.0.0.1 crashlytics.163.com
127.0.0.1 iad.g.163.com
# [1mobile.com]
127.0.0.1 ads.1mobile.com
127.0.0.1 api4.1mobile.com
# [1rx.io]
127.0.0.1 sync.1rx.io
127.0.0.1 tag.1rx.io
# [206ads.com]
127.0.0.1 s.206ads.com
# [247-inc.net]
127.0.0.1 api.247-inc.net
127.0.0.1 tie.247-inc.net
nav.booksonlineclub.com
navi.businessconsults.net
navi.earthsolution.org
nci.bigdepression.net
nci.dnsweb.org
nci.safalife.com
ncih.dnsweb.org
ncsc.businessconsults.net
ne.hugesoft.org
nes.nationtour.net
net.firefoxupdata.com
net.infosupports.com
new.arrowservice.net
new.booksonlineclub.com
new.firefoxupdata.com
new.globalowa.com
newport.bigdepression.net
newport.infosupports.com
newport.safalife.com
news.advanbusiness.com
news.aoldaily.com
news.aolon1ine.com
# Oldest record: 2021-09-03T02:06:30+02:00
# Number of source websites: 873
# Number of source subdomains: 1990306
# Number of source DNS records: ~2E9 + 1298142
#
# Input rules: asns: 6, zones: 48
# Subsequent rules: asns: 6, hostnames: 122196, ip4s: 64, zones: 48
# … no duplicates: asns: 6, hostnames: 89794, zones: 48
# Output rules: hostnames: 122196
#
0.0.0.0 0001.ya-man.com
0.0.0.0 0002.onlyminerals.jp
0.0.0.0 000.affex.org
# Title: NoTrack Malware Blocklist
# Description: Domains classified as malware, phishing or adware
# Author: QuidsUp
# License: GNU General Public License v3.0
# Home: https://quidsup.net/notrack/blocklist.php
# @ GitLab : https://gitlab.com/quidsup/notrack-blocklists
# Updated: 08 Sep 2021
#LatestVersion 21.08
# Domain Count: 348
#===============================================================
2track.info #Adware - Malware
4dsply.com #Adware - Malware
acountscr.cool #Adware Lnkr - Malware
ad2up.com #Adware - Malware
adaranth.com #Adware - Malware
adbigline.network #Malware - Malware
addr.cx #Adware Lnkr - Malware
adfuture.cn #Android Trojan - Malware
adsunflower.com #Android Trojan - Malware
adultsonly.pro #Generic - Malware
有没有一种方法可以仅保留blocklist_raw.txt中的FQDN并删除所有其他文本?非常感谢任何指引。谢谢。
编辑:这是我到目前为止的代码。我以前从未编写过Python,所以很多可能并不是很清楚:
#!/usr/bin/env python3
import urllib
import urllib.request
import re
import sys
original_stdout = sys.stdout
def removeStr(val):
if val.count('.') >= 2:
if val.count('/') <= 0:
return val
with open("blocklist_urls.txt", "r") as a:
urls = a.readlines()
retrieved_pages = []
for url in urls:
retrieved_pages.append(urllib.request.urlopen(url).read())
for page in retrieved_pages:
decoded_line = page.decode("utf-8")
each_line = "\n".join(filter(removeStr, decoded_line.split()))
urls_filtered_raw = each_line.replace('0.0.0.0', '\b').replace('127.0.0.1', '\b')
with open('blocklist_raw.txt', 'w') as b:
for page in retrieved_pages:
sys.stdout = b
print(urls_filtered_raw.rstrip("\n").rstrip("^H"))
sys.stdout = original_stdout
links = set()
with open('blocklist_raw.txt', 'r') as fp:
for line in fp.readlines():
links.add(line)
with open('blocklist_raw.txt', 'w') as fp:
for line in links:
fp.write(line)
sorted_urls_raw = open('blocklist_raw.txt', 'r')
sorted_urls_list = sorted_urls_raw.readlines
split_hosts = []
for h in sorted_urls_raw:
segments = h.split('.')
segments.reverse()
split_hosts.append(segments)
split_hosts.sort()
for segments in split_hosts:
segments.reverse()
print(".".join(segments))
我想找到一种方法按字母顺序对输出进行排序,并将结果写回文件中。谢谢大家。