Python中读取AVRO文件时出现错误

3

我已成功在Python中安装了Apache Avro。然后,我按照以下指示尝试将Avro文件读入Python。

https://avro.apache.org/docs/1.8.1/gettingstartedpython.html

我有一堆Avro文件在一个目录中,该目录已在Python中设置为正确的路径。以下是我的代码:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
   print (user)
reader.close()

然而,它返回了以下错误:
Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
    self._read_header()
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
    META_SCHEMA, META_SCHEMA, self.raw_decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
    return self.read_record(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
    field_val = self.read_data(field.type, readers_field.type, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
    return self.read_fixed(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
    return decoder.read(writer_schema.size)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
    input_bytes = self.reader.read(n)
  File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>

我知道在指导示例中首先创建了一个模式。但是什么是avsc文件?在我的情况下,我应该如何创建它和相应的模式?理想情况下,我想将Avro文件读入Python,并将其保存为csv格式在磁盘上或将其保存为数据帧/列表类型以供进一步分析。我使用的是Windows 7上的Python 3。

编辑 我尝试了Stephane的代码,但返回了新的错误。

Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
    self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'

修订2: Stephane的代码在大多数情况下都能正常运行,但有时会出现像这样的断言错误。

Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n   1.         KEY INFORMATION \n \n (a) Full name of discloser:                                                                        BlackRock, Inc. \n-------------------------------------------------------------------------------------------------  ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe
1个回答

7
您正在使用Windows和Python 3。
在Python 3中,默认情况下,open打开文件是以文本模式。这意味着当进一步的读取操作发生时,Python将尝试将文件内容从某个字符集解码为Unicode。
您没有指定默认字符集,因此Python尝试按照charmap(在Windows上默认)编码的方式解码内容。
显然,您的avro文件不是用charmap编码的,并且解码失败会出现异常。
据我所记,avro头文件无论如何都是二进制内容...不是文本内容(不确定)。因此,您首先应该尝试不使用open对文件进行解码: reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader()) (注意'rb',即二进制模式)
编辑:对于下一个问题(AttributeError),您遇到了一个已知的错误,在1.8.1中未修复。在下一个版本发布之前,您可以尝试以下方法:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io


class MyDataFileReader(DataFileReader):
    def __init__(self, reader, datum_reader):
        """Initializes a new data file reader.

        Args:
          reader: Open file to read from.
          datum_reader: Avro datum reader.
        """
        self._reader = reader
        self._raw_decoder = avro_io.BinaryDecoder(reader)
        self._datum_decoder = None  # Maybe reset at every block.
        self._datum_reader = datum_reader

        # read the header: magic, meta, sync
        self._read_header()

        # ensure codec is valid
        avro_codec_raw = self.GetMeta('avro.codec')
        if avro_codec_raw is None:
            self.codec = "null"
        else:
            self.codec = avro_codec_raw.decode('utf-8')
        if self.codec not in VALID_CODECS:
            raise DataFileException('Unknown codec: %s.' % self.codec)

        self._file_length = self._GetInputFileLength()

        # get ready to read
        self._block_count = 0
        self.datum_reader.writer_schema = (
            schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))


reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
    print (user)
reader.close()

很奇怪,这样愚蠢的错误竟然能够进入发布阶段,这并不是代码成熟度的标志!


这里是链接:https://issues.apache.org/jira/browse/AVRO-1741。补丁似乎没有在Avro Python3 1.8.1中合并。 - Stephane Martin
感谢提供代码。但是似乎代码存在一些错误。我猜想在Python 3中,使用open时应该使用"rb"而不是"r",对吗?在类块中,它显示"avro_io"和"VALID_CODECS"未定义。如何解决这个问题? - ycenycute
是的,'rb'。对于未定义的符号,我添加了必要的导入。 - Stephane Martin
太棒了,它可以工作!但我如何将reader转换为其他类型?比如字符串,因为我需要进一步解析文本。非常感谢你的帮助。 - ycenycute
嗨,似乎在读取Avro文件时出现了其他问题,您能否请看一下编辑后的问题? - ycenycute
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接