从PDF中提取Python图像序列

Question

从PDF中提取Python图像序列

3

我试图使用PyMuPDF（fitz）从pdf中提取图像。我的pdf在单个页面上有多个图像。我在保存我的图像时维护适当的序列号。我发现提取的图像没有按照正确的顺序进行。有时它从底部开始提取，有时从顶部开始等等。有没有一种方法可以修改我的代码，使提取的顺序按照正确的顺序进行？以下是我使用的代码：

import fitz
from PIL import Image
filename = "document.pdf"
doc = fitz.open(filename)

for i in range(len(doc)):
    img_num = 0
    p_no = 1
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n - pix.alpha < 4:
            img_num += 1       
            pix.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
        else:
            img_num += 1              
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
            pix1 = None
        pix = None
        p_no += 1

以下是pdf文件的示例页面：

（注：本文中的html标签已保留）

- Sabster

也许可以使用 doc.getPageImageList(i).sort() 来对 img 进行排序？ - undefined

不行。不起作用。我得到了以下错误：TypeError: 'NoneType' 对象不可迭代 - undefined

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ecko · Accepted Answer

我有同样的问题，我使用了以下代码：

import fitz 
import io
from PIL import Image


file = "file_path"
pdf_file = fitz.open(file)


for page_index in range(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this page
    if image_list:
        print(f"[+] Found  {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on the given pdf page", page_index)
    for image_index, img in enumerate(page.getImageList(), start=1):
        print(img)
        print(image_index)
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

最有可能的方法是定位“img”变量并对它们进行排序。如果您有更好的想法/解决方案，请告诉我，我很乐意听取进一步建议。