我正在尝试使用PDFMiner的Python绑定从大量PDF中提取文本。我编写的模块适用于许多PDF,但对于部分PDF,我会收到这个有点晦涩的错误消息:
ipython堆栈跟踪:
/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
331 break
332 else:
--> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
334 if self.catalog.get('Type') is not LITERAL_CATALOG:
335 if STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
当然,我立即检查这些PDF是否已经损坏了,但它们可以被正常读取。尽管缺少根对象,有没有办法阅读这些PDF?我不太确定该从哪里开始。
非常感谢!
编辑:
我试图使用PyPDF进行一些差异诊断。下面是堆栈跟踪:
In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
372 self.flattenedPages = None
373 self.resolvedObjects = {}
--> 374 self.read(stream)
375 self.stream = stream
376 self._override_encryption = False
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
708 line = self.readNextEndLine(stream)
709 if line[:5] != "%%EOF":
--> 710 raise utils.PdfReadError, "EOF marker not found"
711
712 # find startxref entry - the location of the xref table
PdfReadError: EOF marker not found
Quonux建议可能是因为PDFMiner在遇到第一个EOF字符后停止解析。这似乎表明情况并非如此,但我很茫然。有什么想法吗?