使用PDFBox创建带标签的PDF

Question

使用PDFBox创建带标签的PDF

8

PDFBox是否可以创建带标签的PDF（PDF / UA）？看起来PDFBox有一个API可以实现这个功能（包org.apache.pdfbox.pdmodel.documentinterchange.taggedpdf），但我找不到任何教程或代码示例。

使用以下代码，我生成了一个包含图像的PDF文件，NVDA屏幕阅读器（在我的情况下）识别它并读取“... graphic Alternate Description”。然而，无障碍检查器PAC 2显示错误：“图像对象未标记”。

        PDDocument doc = new PDDocument();
        PDPage page = new PDPage();
        doc.addPage(page);
        PDDocumentCatalog documentCatalog = doc.getDocumentCatalog();

        PDImageXObject pdImage = PDImageXObject.createFromFile(imagePath, doc);
        PDPageContentStream contents = new PDPageContentStream(doc, page);
        contents.drawImage(pdImage, 100, 600, pdImage.getWidth() / 2, pdImage.getHeight() / 2);
        contents.close();

        PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
        PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure, treeRoot);
        structureElement.setPage(page);

        PDMarkedContent markedImg = new PDMarkedContent(COSName.IMAGE, new COSDictionary());
        markedImg.addXObject(pdImage);

        structureElement.appendKid(markedImg);
        structureElement.setAlternateDescription("Alternate Description");
        treeRoot.appendKid(structureElement);
        documentCatalog.setStructureTreeRoot(treeRoot);
        // ....
        doc.save(fileName);

你能提供一些关于这个主题的解释和/或代码示例吗？

- Leonid Muzyka

很遗憾，没有示例，主要是因为据我所知，我们中没有人参与创建这样的文件。（我是PDFBox的提交者）我能为您做的唯一事情就是修复您可能发现的任何错误。您可以做的是使用其他工具创建一个文件，然后使用PDFBox PDFDebugger查看其结构并进行复制。 - Tilman Hausherr

@TilmanHausherr，感谢您提供的PDFDebugger。现在的主要问题是如何直接在PDPageContentStream中编写PDStructureElement。 - Leonid Muzyka

我猜你指的是BMC、BDC、EMC、MP和DP。此时，您需要使用（已弃用的）“原始”方法。或者您可以在JIRA中创建一个请求以获取一些新的方法 :-) - Tilman Hausherr

PDFBox 1.8可以创建PDF/A，但仅支持PDF/A-1b，而不是覆盖PDF/UA的PDF/A-1a。我还没有找到PDFBox 2.0是否支持PDF/A-1a的信息。如果使用PDFBox 2生成的PDF/A文档没有可访问性标签，那么我会认为它尚未得到支持？ - Tsundoku

@leomuz，你有Acrobat吗？你可以在Acrobat中运行可访问性检查器，看看它是否与PAC2出现相同的错误。你也可以查看标签树（视图>显示/隐藏>导航窗格>标签）。如果你没有Acrobat，可以私下联系我，我可以帮你查看文件。请查看我的StackOverflow个人资料以了解如何联系我。我无法帮助PDFBox，但也许看到错误出现的位置会有所帮助。 - slugolicious

OpenHTMLtoPDF现在支持标记PDF。请参阅可访问的PDF维基页面：https://github.com/danfickle/openhtmltopdf/wiki/PDF-Accessibility-(PDF-UA,-WCAG,-Section-508)-Support - Daniel F

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mlovell · Accepted Answer

我已经提供了一个示例，演示如何使用PDFBox 2创建可访问的PDF：https://github.com/martinlovell/accessible-pdfbox-example。

问题中的代码缺少一些内容。标记内容需要alt文本，并且我认为你需要为该标记内容提供mcids。

示例项目更详细地演示了您需要的内容。

大致如下：

PDPageContentStream contents = new PDPageContentStream(doc, page);

// the content in the stream needs an id
int mcid = 5;
COSDictionary dictionary = new COSDictionary();
dictionary = new COSDictionary();
dictionary(COSName.MCID, mcid);

// wrap image drawing in marked content
contents.beginMarkedContent(COSName.IMAGE, PDPropertyList.create(dictionary));
contents.drawImage(pdImage, 100, 600, pdImage.getWidth() / 2, pdImage.getHeight() / 2);
contents.endMarkedContent();

contents.close();

PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
documentCatalog.setStructureTreeRoot(treeRoot);
PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure, treeRoot);
structureElement.setPage(page);
structureElement.setAlternateDescription("Alternate Description");

// Set alt text on marked content for structure.  
// This is the dictionary with the mcid used in beginMarkedContent.
dictionary.setString(COSName.ALT, "Alternate Description");
PDMarkedContent markedImg = new PDMarkedContent(COSName.IMAGE, dictionary);
markedImg.addXObject(pdImage);
structureElement.appendKid(markedImg);