如何使用iTextSharp从PDF中提取高亮文本？

Question

如何使用iTextSharp从PDF中提取高亮文本？

5

根据以下文章：iTextSharp PDF Reading highlighed text (highlight annotations) using C#，此代码：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

正在尝试提取PDF注释。但是，为什么相同的以下代码无法突出显示（特别是PdfName.HIGHLIGHT无效）：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

- John Stevensons

2个回答

4

这是使用iTextSharp提取高亮文本的完整示例：

public void GetRectAnno()
{

    string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

    string filePath = appRootDir + "/PDFs/" + "anot.pdf";

    int pageFrom = 0;
    int pageTo = 0;

    try
    {
        using (PdfReader reader = new PdfReader(filePath))
        {
            pageTo = reader.NumberOfPages;
            
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                

                PdfDictionary page = reader.GetPageN(i);
                PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                if (annots != null)
                    foreach (PdfObject annot in annots.ArrayList)
                    {
                        
                        //Get Annotation from PDF File
                        PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                        PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                        //check only subtype is highlight
                        if (subType.Equals(PdfName.HIGHLIGHT))
                        {
                              // Get Quadpoints and Rectangle of highlighted text
                            Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                            //Extract Text using rectangle strategy    
                            PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);
                                                      
                            Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                            float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));



                            RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                            ITextExtractionStrategy strategy;
                            StringBuilder sb = new StringBuilder();

                            
                            strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                            sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
                            
                            //Show extract text on Console
                            Console.WriteLine(sb.ToString());
                            //Console.WriteLine("Page No" + i);

                        }



                    }



            }
        }
    }
    catch (Exception ex)
    {
    }
}

- Hassan Nazeer

2

如果出现跨越多行的高亮，且其起始或结束位置在行中间，那么你提取的内容可能会过多。考虑检查 QuadPoints 而不是 Rect。例如，这个问题讨论了一个类似的情况，尽管使用的是不同的库，而这个答案则详细讨论了相关细节。 - mkl

你可以使用 PdfArray quadPoints = annotationDic.GetAsArray(PdfName.QUADPOINTS); 来获取注释的四边形点。 - Gangula

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bruno Lowagie · Accepted Answer

请查看ISO-32000-1（也称为PDF参考）中的表30。它的标题是“页面对象中的条目”。在这些条目中，您可以找到一个名为Annots的键。它的值为：

（可选）注释字典的数组，应包含与页面关联的所有注释的间接引用（请参见12.5，“注释”）。

您不会找到一个名为Highlight的键，因此当您使用以下行时返回的数组为空是正常的：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

你需要按照之前的方式获取注释信息：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

现在您需要循环遍历此数组，并查找带有 Subtype 等于 Highlight 的注释。这种类型的注释列在 ISO-32000-1 表169中，名为“注释类型”。

换句话说，您对页面字典包含键为 Highlight 的条目的假设是错误的，如果您阅读整个规范，您还将发现另一个错误的假设。您错误地假定突出显示的文本存储在注释的 Contents 条目中。这揭示了对注释与页面内容性质的理解不足。

您要查找的文本存储在页面的内容流中。页面的内容流独立于页面的注释。因此，要获取突出显示的文本，您需要获取存储在 Highlight 注释（存储在 QuadPoints 数组中）中的坐标，并使用这些坐标解析在那些坐标处页面内容中存在的文本。