使用Apache Tika从PDF中提取图像

Question

使用Apache Tika从PDF中提取图像

imagepdfapache-tika

5

Apache Tika 1.6可以从PDF文档中提取内嵌的图片。然而，我一直在努力让它工作。

我的使用场景是，我想要一些代码，能够从任何文档中（不一定是PDF）分别提取内容和图片。然后将其传递到Apache UIMA流水线中。

我已经能够通过使用自定义解析器（基于AutoParser构建）从其他文件类型中提取图像，将文档转换为HTML，然后将图像单独保存出来。但是当我尝试使用PDF时，标签甚至没有出现在HTML中，更别说让我访问文件了。

有人能否建议我如何实现上述目标，最好还能提供一些使用Tika 1.6从PDF中提取内嵌图片的代码示例？

- James Baker

TIKA-1268和TIKA-1396都在1.6版本中被标记为已修复，您确定您真的在使用Tika 1.6吗？ - Gagravarr

假设网站上标记为1.6且名为tika-app-1.6.jar的文件实际上是Tika 1.6，那么我确定！ - James Baker

你正在尝试使用--extract标志测试图像提取的Tika应用程序吗？ - Gagravarr

我正在尝试以编程方式实现它，但我已经尝试了--extract标志和使用GUI，但是无论哪种方法都没有成功地在文档中找到图像。 - James Baker

似乎你需要跳上其中一个 bug，然后标记它尚未完全修复。 - Gagravarr

2个回答

3

尝试下面的代码，ContentHandler已经转换了您的xml内容。

public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{           

    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    ContentHandler handler =   new ToXMLContentHandler();
    PDFParser parser = new PDFParser(); 

    PDFParserConfig config = new PDFParserConfig();
    config.setExtractInlineImages(true);
    config.setExtractUniqueInlineImagesOnly(true);

    parser.setPDFParserConfig(config);


    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.copy(stream, outputFile);
        }
    };

    context.set(PDFParser.class, parser);
    context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );

    try (InputStream stream = new ByteArrayInputStream(content)) {
        parser.parse(stream, handler, metadata, context);
    }

    return handler;
}

- Goran

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Cardin · Accepted Answer

使用 AutoDetectParser 无需依赖 PDFParser 即可提取图像。此代码同样适用于从 docx、pptx 等文件中提取图像。

这里我有个 parseDocument() 和一个 setPdfConfig() 函数，它们使用了 AutoDetectParser。

创建一个 AutoDetectParser。
将一个 EmbeddedDocumentExtractor 添加到 ParseContext 中。
将 AutoDetectParser 添加到相同的 ParseContext 中。
将一个 PDFParserConfig 添加到相同的 ParseContext 中。
然后将该 ParseContext 提供给 AutoDetectParser.parse()。

图像保存在与源文件相同位置的文件夹中，名称为 <sourceFile>_/。

private static void setPdfConfig(ParseContext context) {
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(true);

    context.set(PDFParserConfig.class, pdfConfig);
}

private static String parseDocument(String path) {
    String xhtmlContents = "";

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new ToXMLContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputDir = new File(path + "_").toPath();
            Files.createDirectories(outputDir);

            Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.deleteIfExists(outputPath);
            Files.copy(stream, outputPath);
        }
    };

    context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
    context.set(AutoDetectParser.class, parser);

    setPdfConfig(context);

    try (InputStream stream = new FileInputStream(path)) {
        parser.parse(stream, handler, metadata, context);
        xhtmlContents = handler.toString();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException | TikaException e) {
        e.printStackTrace();
    }

    return xhtmlContents;
}