在Python中读取超大文件

Question

在Python中读取超大文件

14

我有一个384MB的文本文件，里面有5000万行。每行包含两个用空格隔开的整数：一个键和一个值。该文件按键排序。我需要在Python中查找大约200个键的值的有效方法。

我的当前方法如下所示。它需要30秒钟。肯定有更有效率的Python解决方案，可以将这个时间缩短到最多几秒钟。

# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

编写了二分查找+搜索解决方案（感谢kigurai！）：

entries = 24935502 # number of entries
width   = 18       # fixed width of an entry in the file padded with spaces
                   # at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, entries-1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid * width)
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search

- moinudin

我很想看到你的最终解决方案，因为它似乎是由没有提供代码的答案组合而成的。 - Paolo Bergantino

8个回答

7

对 S.Lotts 的回答进行了轻微优化：

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys as strings
for line in fin:
    key, value = line.split()
    if key in targetKeys:
        keyValues[key].append( value )

由于我们使用的是字典而不是列表，因此键值不必是数字。这样可以节省每行代码中map()操作和字符串转整数的时间。如果您想让键值是数字，请在最后进行转换，这样您只需要为每个键值转换一次，而不是每5000万行都要转换。

- Chris Upchurch

4

“list [pointer]”是什么意思并不清楚。但请考虑以下内容。

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys
for line in fin:
    key, value = map( int, line.split())
    if key in targetKeys:
        keyValues[key].append( value )

- S.Lott

这比我在问题中提到的方法慢。 :(PS：我添加了一些注释来更好地解释我的代码片段。 - moinudin

3

这是一个对文本文件进行递归二分查找的算法。

import os, stat

class IntegerKeyTextFile(object):
    def __init__(self, filename):
        self.filename = filename
        self.f = open(self.filename, 'r')
        self.getStatinfo()

    def getStatinfo(self):
        self.statinfo = os.stat(self.filename)
        self.size = self.statinfo[stat.ST_SIZE]

    def parse(self, line):
        key, value = line.split()
        k = int(key)
        v = int(value)
        return (k,v)

    def __getitem__(self, key):
        return self.findKey(key)

    def findKey(self, keyToFind, startpoint=0, endpoint=None):
        "Recursively search a text file"

        if endpoint is None:
            endpoint = self.size

        currentpoint = (startpoint + endpoint) // 2

        while True:
            self.f.seek(currentpoint)
            if currentpoint <> 0:
                # may not start at a line break! Discard.
                baddata = self.f.readline() 

            linestart = self.f.tell()
            keyatpoint = self.f.readline()

            if not keyatpoint:
                # read returned empty - end of file
                raise KeyError('key %d not found'%(keyToFind,))

            k,v = self.parse(keyatpoint)

            if k == keyToFind:
                print 'key found at ', linestart, ' with value ', v
                return v

            if endpoint == startpoint:
                    raise KeyError('key %d not found'%(keyToFind,))

            if k > keyToFind:
                return self.findKey(keyToFind, startpoint, currentpoint)
            else:
                return self.findKey(keyToFind, currentpoint, endpoint)

在 jEdit 中创建的样本文本文件似乎有效：

>>> i = integertext.IntegerKeyTextFile('c:\\sampledata.txt')
>>> i[1]
key found at  0  with value  345
345

通过缓存找到的键值并使用缓存确定未来的起始查找点，可以显著提高性能。

- Joe Koberg

很好的技巧，可以确保它从换行符开始，但由于我对输入文件有完全控制，所以更快的方法是将其格式化为每行具有固定宽度。请参见我在原始问题末尾的实现。 - moinudin

3

我会使用内存映射：http://docs.python.org/library/mmap.html。这样你就可以像在内存中存储文件一样使用，但操作系统会决定哪些页面应该从文件中读取。

- Bastien Léonard

2

如果您对文件格式有控制权，则“排序和二分搜索”响应是正确的。关键在于，这仅适用于具有固定大小和偏移量的记录（我应该说仅适用于具有固定长度记录的易于操作情况）。

对于具有固定长度的记录，您可以轻松地使用seek()在已排序的文件中查找您的键。

- Will Hartung

0

一种可能的优化方法是使用file.readlines(..)中的sizehint选项进行缓冲。这允许您将多个行加载到内存中，总计约为sizehint字节。

- Il-Bhima

0

你需要使用 seek() 实现二分查找。

- vartec

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Hannes Ovrén · Accepted Answer

如果你只需要五千万行中的200行，那么读取全部内容到内存是浪费的。我会将搜索关键字列表进行排序，然后使用seek()或类似的方法对文件应用二分查找。这样就不需要将整个文件读入内存，我认为这应该能加快速度。