Python对象所占用的内存比磁盘上相同的值要多一些;有一个引用计数的小开销,在集合中还需要考虑每个值的缓存哈希值。
不要将所有这些对象读入(Python)内存中,而是使用数据库。Python自带了SQLite数据库库,可以使用它将文件转换为数据库。然后,您可以从数据库构建输出文件:
import csv
import sqlite3
from itertools import islice
conn = sqlite3.connect('/tmp/ipaddresses.db')
conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
conn.execute('''\
CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx
ON ipaddress(domain, ip)''')
for filename in files:
print(filename)
with open(filename, 'rb') as f:
reader = csv.reader(f, delimiter='|')
cursor = conn.cursor()
while True:
with conn:
batch = list(islice(reader, 10000))
if not batch:
break
cursor.executemany(
'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
batch)
conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
with open(outputfile, 'wb') as outfh:
writer = csv.writer(outfh, delimiter='|')
cursor = conn.cursor()
cursor.execute('SELECT ip, domain from ipaddress order by ip')
writer.writerows(cursor)
这将按10000条一批处理您的输入数据,并在插入后生成一个索引以进行排序。生成索引需要一些时间,但它可以全部适合您可用的内存。
开始创建的UNIQUE索引确保只插入唯一的域名-IP地址对(因此仅跟踪每个IP地址的唯一域名); INSERT OR IGNORE语句跳过已经存在于数据库中的任何对。
这是您提供的示例输入的简短演示:
>>> import sqlite3
>>> import csv
>>> import sys
>>> from itertools import islice
>>> conn = sqlite3.connect('/tmp/ipaddresses.db')
>>> conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
<sqlite3.Cursor object at 0x106c62730>
>>> conn.execute('''\
... CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx
... ON ipaddress(domain, ip)''')
<sqlite3.Cursor object at 0x106c62960>
>>> reader = csv.reader('''\
... yahoo.com|89.45.3.5
... bbc.com|45.67.33.2
... yahoo.com|89.45.3.5
... myname.com|45.67.33.2
... '''.splitlines(), delimiter='|')
>>> cursor = conn.cursor()
>>> while True:
... with conn:
... batch = list(islice(reader, 10000))
... if not batch:
... break
... cursor.executemany(
... 'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
... batch)
...
<sqlite3.Cursor object at 0x106c62810>
>>> conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
<sqlite3.Cursor object at 0x106c62960>
>>> writer = csv.writer(sys.stdout, delimiter='|')
>>> cursor = conn.cursor()
>>> cursor.execute('SELECT ip, domain from ipaddress order by ip')
<sqlite3.Cursor object at 0x106c627a0>
>>> writer.writerows(cursor)
45.67.33.2|bbc.com
45.67.33.2|myname.com
89.45.3.5|yahoo.com
multiprocessing.pool
创建工作进程池,并映射文件列表。这样每个文件都将由自己的工作进程处理。 - Darth Kotik