使用Python逐行读取PDF文件

Question

使用Python逐行读取PDF文件

6

我使用了以下代码读取PDF文件，但无法读取。可能的原因是什么？

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)

输出结果为[u'']，而不是读取内容。

- Rahul Pipalia

除了0以外的其他页码也能正常工作吗？您确定PDF中有文本，而不仅仅是图像或图形吗？ - mkrieger1

7个回答

0

问题可能是以下两种情况之一：(1) 文本不在第一页，因此是用户错误。(2) PyPDF2 无法提取文本，因此是 PyPDF2 的一个 bug。

遗憾的是，对于某些 PDF 文件，仍然会出现第二种情况。

- Martin Thoma

0

也许这可以帮助你阅读PDF。

import pyPdf
def getPDFContent(path):
    content = ""
    pages = 10
    p = file(path, "rb")
    pdf_content = pyPdf.PdfFileReader(p)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

- Tejas Thakar

0

def getTextPDF(pdfFileName,password=''):
    import PyPDF2
    from PyPDF2 import PdfFileReader, PdfFileWriter
    from nltk import sent_tokenize
    """ Extract Text from pdf  """
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    text = '\n'.join (text).replace("\n",'')
    text = sent_tokenize(text)
    return text

- thrinadhn

0

我认为你需要指定磁盘名称，在你的目录中缺少它。例如 "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf"。我尝试了一下，没有任何问题。

或者如果你想使用os模块查找文件路径，但你没有真正将其与你的目录关联起来，你可以尝试以下方法：

from PyPDF2 import PdfFileReader
import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')

f = open(directory, 'rb')

reader = PdfFileReader(f)

contents = reader.getPage(0).extractText().split('\n')

f.close()

print(contents)

在Nadia Alramli的回答中可以找到find函数Python中查找文件

- Ahaha

0

要从目录中的多个文件夹读取文件，可以使用以下代码- 此示例用于阅读pdf文件：

import os
from tika import parser

path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
    for file in f:
        if ".pdf" in file:  # reading only PDF files
            file_join = os.path.join(r, file)   #getting full path 
            file_data = parser.from_file(file_join)     # parsing the PDF file 
            text = file_data['content']               # read the content 
            print(text)                  #print the content

- Anush

这太复杂了，以展示目录遍历。回答所问的问题。Translated text: 这太复杂了，以展示目录遍历。回答所问的问题。 - Kickaha

-2

你好，Rahul Pipalia，

如果你的Python中没有安装PyPDF2，请先安装PyPDF2模块。

Ubuntu安装步骤（安装python-pypdf）

首先，打开终端
然后输入sudo apt-get install python-pypdf

解决方案

尝试使用以下代码：

# Import Library
import PyPDF2

# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)

page_content = page.extractText()

# Display content of the pdf
print page_content

请从以下链接下载PDF并尝试此代码， https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

希望我的回答有所帮助。
如果有任何疑问，请在评论区留言。

- Er CEO Vora Mayur

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Piyush Rumao · Accepted Answer

import re
from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")

for page in reader.pages:
    text = page.extractText()
    text_lower = text.lower()
    for line in text_lower:
        if re.search("abc", line):
            print(line)

我使用它来逐页迭代pdf并搜索其中的关键词，然后进行进一步的处理。