Pickle文件过大无法加载。

20
我遇到的问题是,我有一个非常大的pickle文件(2.6 GB),我正在尝试打开它,但每次都会出现内存错误。我现在意识到我应该使用数据库来存储所有信息,但现在为时已晚。pickle文件包含了从互联网上爬取的美国国会记录中的日期和文本(运行大约需要2周时间)。
是否有任何方法可以逐步访问我转储到pickle文件中的信息,或者将pickle文件转换为SQL数据库或其他可以打开而无需重新输入所有数据的格式?我真的不想再花费另外2周时间重新爬取国会记录并将数据输入到数据库中。
感谢您的帮助!
编辑*
对象如何被pickled的代码:
def save_objects(objects): 
    with open('objects.pkl', 'wb') as output: 
        pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)

def Main():   
    Links()
    file = open('datafile.txt', 'w')
    objects = []
    with open('links2.txt', 'rb') as infile:
        for link in infile: 
            print(link)
            title, text, date = Get_full_text(link)
            article=Doccument(title, date, text)
            if text != None:
                write_to_text(date, text)
                objects.append(article)
                save_objects(objects)

这是出错的程序:

def Main():
    file = open('objects1.pkl', 'rb') 
    object = pickle.load(file)

谷歌搜索显示:https://code.google.com/p/streaming-pickle/。我不知道它是否有效。 - Robᵩ
如果您正在使用32位版本的Python,则增加更多的RAM不太可能有所帮助。 - Robᵩ
1
你能提供一个演示如何“逐步转储到pickle文件”的示例程序吗? - Robᵩ
1
说起来,您能否提供一个简短完整的程序,演示一下您所遇到的内存错误? - Robᵩ
为什么要使用SQLite而不是PostgreSQL? - Jwan622
显示剩余7条评论
3个回答

44

看起来你遇到了一点麻烦!;-) 希望在此之后,你永远不会再使用pickle。它只是不是一个很好的数据存储格式。

无论如何,对于这个答案,我假设你的Document类看起来有点像这样。如果不是,请在评论中提供您实际的Document类:

class Document(object): # <-- object part is very important! If it's not there, the format is different!
    def __init__(self, title, date, text): # assuming all strings
        self.title = title
        self.date = date
        self.text = text

无论如何,我用这个类创建了一些简单的测试数据:

d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]

使用格式为2 (对于Python 2.x,使用pickle.HIGHEST_PROTOCOL)进行了序列化处理。

>>> s = pickle.dumps(d, 2)
>>> s
'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

接着使用pickletools对其进行解构:

>>> pickletools.dis(s)
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: c        GLOBAL     '__main__ Document'
   25: q        BINPUT     1
   27: )        EMPTY_TUPLE
   28: \x81     NEWOBJ
   29: q        BINPUT     2
   31: }        EMPTY_DICT
   32: q        BINPUT     3
   34: (        MARK
   35: U            SHORT_BINSTRING 'date'
   41: q            BINPUT     4
   43: U            SHORT_BINSTRING '1/1/1'
   50: q            BINPUT     5
   52: U            SHORT_BINSTRING 'text'
   58: q            BINPUT     6
   60: U            SHORT_BINSTRING 'foo is good'
   73: q            BINPUT     7
   75: U            SHORT_BINSTRING 'title'
   82: q            BINPUT     8
   84: U            SHORT_BINSTRING 'foo'
   89: q            BINPUT     9
   91: u            SETITEMS   (MARK at 34)
   92: b        BUILD
   93: h        BINGET     1
   95: )        EMPTY_TUPLE
   96: \x81     NEWOBJ
   97: q        BINPUT     10
   99: }        EMPTY_DICT
  100: q        BINPUT     11
  102: (        MARK
  103: h            BINGET     4
  105: U            SHORT_BINSTRING '2/2/2'
  112: q            BINPUT     12
  114: h            BINGET     6
  116: U            SHORT_BINSTRING 'bar is better'
  131: q            BINPUT     13
  133: h            BINGET     8
  135: U            SHORT_BINSTRING 'bar'
  140: q            BINPUT     14
  142: u            SETITEMS   (MARK at 102)
  143: b        BUILD
  144: h        BINGET     1
  146: )        EMPTY_TUPLE
  147: \x81     NEWOBJ
  148: q        BINPUT     15
  150: }        EMPTY_DICT
  151: q        BINPUT     16
  153: (        MARK
  154: h            BINGET     4
  156: U            SHORT_BINSTRING '3/3/3'
  163: q            BINPUT     17
  165: h            BINGET     6
  167: U            SHORT_BINSTRING 'no one likes baz :('
  188: q            BINPUT     18
  190: h            BINGET     8
  192: U            SHORT_BINSTRING 'baz'
  197: q            BINPUT     19
  199: u            SETITEMS   (MARK at 153)
  200: b        BUILD
  201: e        APPENDS    (MARK at 5)
  202: .    STOP

看起来很复杂!但实际上,它并不那么糟糕。pickle 基本上是一个栈机器,你看到的每个 ALL_CAPS 标识符都是一个 opcode,它以某种方式操作内部 "堆栈" 进行解码。如果我们尝试解析一些复杂的结构,这将更加重要,但幸运的是,我们只是在生成一组基本元组的简单列表。所有这些 "代码" 所做的就是在堆栈上构建一堆对象,然后将整个堆栈推入列表中。

我们确实需要关心的一个事情是你看到的 'BINPUT' / 'BINGET' 操作码。基本上,这些用于 '备忘录',以减少数据占用空间,pickle 使用 BINPUT <id> 保存字符串,然后如果它们再次出现,而不是重新转储它们,就简单地使用 BINGET <id> 从缓存中检索它们。

还有另一个复杂性!不仅有 SHORT_BINSTRING - 对于大于 256 字节的字符串,还有普通的 BINSTRING,以及一些有趣的 Unicode 变体。我假设你正在使用带有所有 ASCII 字符串的 Python 2。如果这不是正确的假设,请再次评论。

好的,所以我们需要在流中读取文件,直到遇到 '\81' 字节 (NEWOBJ)。然后,我们需要向前扫描,直到遇到一个 '(' (MARK) 字符。然后,在遇到 'u' (SETITEMS) 之前,我们读取键/值字符串对 - 应该总共有 3 对,每个字段一个。

那么,让我们开始吧。这是我用流式方式读取 pickle 数据的脚本。它远非完美,因为我只是为了回答这个问题而将其拼凑在一起,你需要大量修改它,但这是一个很好的开始。

pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

# simulate a file here
import StringIO
picklefile = StringIO.StringIO(pickledata)

import pickle # just for opcode names
import struct # binary unpacking

def try_memo(f, v, cache):
    opcode = f.read(1)
    if opcode == pickle.BINPUT:
        cache[f.read(1)] = v
    elif opcode == pickle.LONG_BINPUT:
        print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used'
        f.read(4)
    else:
        f.seek(f.tell() - 1) # rewind

def try_read_string(f, opcode, cache):
    if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]:
        length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i'
        str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0]
        value = f.read(str_length)
        try_memo(f, value, memo_cache)
        return value
    elif opcode == pickle.BINGET:
        return memo_cache[f.read(1)]
    elif opcide == pickle.LONG_BINGET:
        raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4))
    else:
        raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))

memo_cache = {}
while True:
    c = picklefile.read(1)
    if c == pickle.NEWOBJ:
        while picklefile.read(1) != pickle.MARK:
            pass # scan forward to field instantiation
        fields = {}
        while True:
            opcode = picklefile.read(1)
            if opcode == pickle.SETITEMS:
                break
            key = try_read_string(picklefile, opcode, memo_cache)
            value = try_read_string(picklefile, picklefile.read(1), memo_cache)
            fields[key] = value
        print 'Document', fields
        # insert to sqllite
    elif c == pickle.STOP:
        break

这正确地读取了我的用pickle格式2编写的测试数据(修改为一个长字符串):

$ python picklereader.py
Document {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}
Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}
Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}

祝你好运!


12

您没有逐步对数据进行pickle处理。您单个且反复地对数据进行了pickle处理。在每次循环时,您都会销毁所有输出数据(open(...,'wb')会销毁输出文件),然后重新写入所有数据。此外,如果您的程序停止并以新的输入数据重新启动,则旧的输出数据就会丢失。

我不知道为什么在您进行pickle处理时,objects没有导致内存不足错误,因为它增长到与要创建的pickle.load()对象大小相同。

以下是如何逐步创建pickle文件的方法:

def save_objects(objects): 
    with open('objects.pkl', 'ab') as output:  # Note: `ab` appends the data
        pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)

def Main():
    ...
    #objects=[] <-- lose the objects list
    with open('links2.txt', 'rb') as infile:
        for link in infile: 
            ... 
            save_objects(article)

那么你可以像这样逐步读取pickle文件:

import pickle
with open('objects.pkl', 'rb') as pickle_file:
    try:
        while True:
            article = pickle.load(pickle_file)
            print article
    except EOFError:
        pass

我能想到的选择:

  • 尝试使用cPickle。这可能会有所帮助。
  • 尝试使用streaming-pickle
  • 在具有大量RAM的64位环境中读取pickle文件
  • 重新爬取原始数据,这一次实际上是增量存储数据或将其存储在数据库中。 没有不断重新编写pickle输出文件的低效率,您的爬取速度可能会显着提高。

1
非常感谢。我在一台计算机上运行了爬虫,然后尝试在另一台计算机上查看pickle文件。它可以在原始具有更多内存的计算机上工作。 - Vineeth Bhuvanagiri
5
好的,那么答案很明显:从原始计算机中提取pickle文件中的数据。 - Robᵩ

0
我最近也遇到了非常相似的情况——一个11GB的pickle文件。由于没有足够的时间来实现自己的增量加载器或改进现有的加载器以适应我的情况,我没有尝试在我的机器上逐步加载它。
我所做的是,在云托管提供商中启动具有足够内存的大型实例(如果只启动几个小时,则价格不高),通过SSH(SCP)将该文件上传到该服务器,并简单地在该实例上加载它以进行分析并将其重新编写为更合适的格式。
虽然这不是一种编程解决方案,但却是一种节省时间的方法(低成本)。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接