如何使用iTextsharp在PDF文件中突出显示文本或单词?

3

我需要在现有的pdf文件中搜索一个单词,并且希望能够突出显示文本或单词

并保存pdf文件

我有一个想法,使用PdfAnnotation.CreateMarkup可以找到文本的位置,然后我们可以添加背景色...但是我不知道如何实现:(

请帮帮我吧

4个回答

4
这是一个“听起来容易但实际上非常复杂”的事情。请查看马克的这里这里的帖子。最终,您可能会被引导到LocationTextExtractionStrategy。祝你好运!如果你真的找到了如何做到这一点,请在这里发布,有几个人正在想确切地知道你在想什么!

2016年有找到任何解决方案吗?我正在处理同样的问题。使用了LocationTextExtractionStrategy并捕获了字形的坐标。但是,无法突出显示跨越多行的文本。这里的解决方案(https://www.tallcomponents.com/pdfcontrols/highlight-text)如果Y坐标不同,则会创建一个新的注释,这不是期望的解决方案。 - CSR

4

我已经找到了如何做到这一点,万一有人需要从PDF文档中获取带有位置(坐标)的单词或句子,你可以在这个示例项目这里找到它。我使用了VB.NET 2010。记得在此项目中添加对iTextSharp DLL的引用。

我添加了自己的TextExtraction策略类,基于Class LocationTextExtractionStrategy。我专注于TextChunks,因为它们已经具有这些坐标。

有一些已知的限制,例如:

  • 无法进行多行搜索(短语),只允许字符、单词或一行句子。
  • 不能处理旋转的文本。
  • 我没有在横向页面方向的PDF上测试过,但我认为可能需要进行一些修改。
  • 如果你需要在水印上绘制这些高亮/矩形,你需要添加/修改一些代码,但这只是窗体中的代码,与文本/位置提取过程无关。

1
@Jcis,我实际上通过使用您的示例作为起点,找到了处理多个搜索的解决方法。我在一个C#项目中使用您的项目作为参考,并修改了它的功能。我不仅仅是突出显示搜索词,而是在搜索词周围绘制一个白色矩形,然后使用矩形坐标放置一个表单字段。我还不得不交换contentbyte写入模式以获取覆盖内容,以便完全阻止搜索文本。我实际上创建了一个搜索词字符串数组,然后使用for循环,创建所需数量的不同文本字段。
        Test.Form1 formBuilder = new Test.Form1();

        string[] fields = new string[] { "%AccountNumber%", "%MeterNumber%", "%EmailFieldHolder%", "%AddressFieldHolder%", "%EmptyFieldHolder%", "%CityStateZipFieldHolder%", "%emptyFieldHolder1%", "%emptyFieldHolder2%", "%emptyFieldHolder3%", "%emptyFieldHolder4%", "%emptyFieldHolder5%", "%emptyFieldHolder6%", "%emptyFieldHolder7%", "%emptyFieldHolder8%", "%SiteNameFieldHolder%", "%SiteNameFieldHolderWithExtraSpace%" };
        //int a = 0;
        for (int a = 0; a < fields.Length; )
        {
            string[] fieldNames = fields[a].Split('%');
            string[] fieldName = Regex.Split(fieldNames[1], "Field");
            formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, htmlToPdf, finalhtmlToPdf, fieldName[0]);
            File.Delete(htmlToPdf);
            System.Array.Clear(fieldNames, 0, 2);
            System.Array.Clear(fieldName, 0, 1);
            a++;
            if (a == fields.Length)
            {
                break;
            }
            string[] fieldNames1 = fields[a].Split('%');
            string[] fieldName1 = Regex.Split(fieldNames1[1], "Field");
            formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, finalhtmlToPdf, htmlToPdf, fieldName1[0]);
            File.Delete(finalhtmlToPdf);
            System.Array.Clear(fieldNames1, 0, 2);
            System.Array.Clear(fieldName1, 0, 1);
            a++;
        }

在您的示例中,它会将PDFTextGetter函数在两个文件之间来回传递,直到我获得最终产品。它运行得非常好,如果没有您最初的项目,这是不可能的,所以非常感谢您。我还修改了您的VB代码,使其可以进行文本字段映射,如下所示:

           For Each rect As iTextSharp.text.Rectangle In MatchesFound
                cb.Rectangle(rect.Left, rect.Bottom + 1, rect.Width, rect.Height + 4)
                Dim field As New TextField(stamper.Writer, rect, FieldName & Fields)
                Dim form = stamper.AcroFields
                Dim fieldKeys = form.Fields.Keys
                stamper.AddAnnotation(field.GetTextField(), page)
                Fields += 1
            Next

我想分享一下,我是如何以你的项目为骨架完成我的工作的。它甚至可以按照我需要的方式递增字段名称。我还不得不向你的函数添加一个新参数,但这里不值得列举。再次感谢你提供的这个绝佳起点。


太好了!很高兴它对你有用。我的代码是为想要在单词上放置矩形以隐藏它们的人而制作的,在这种特定情况下,只需调用Fill()而不设置任何颜色就可以完成,但我改为使用高亮示例,以便能够将我的代码放在此线程中。 - Jcis

1

感谢Jcis!

经过几个小时的研究和思考,我找到了你的解决方案,它帮助我解决了我的问题。

有两个小错误。

第一个:在读取器之前必须关闭印章机,否则会抛出异常。

Public Sub PDFTextGetter(ByVal pSearch As String, ByVal SC As StringComparison, ByVal SourceFile As String, ByVal DestinationFile As String)
    Dim stamper As iTextSharp.text.pdf.PdfStamper = Nothing
    Dim cb As iTextSharp.text.pdf.PdfContentByte = Nothing

    Me.Cursor = Cursors.WaitCursor
    If File.Exists(SourceFile) Then
        Dim pReader As New PdfReader(SourceFile)

        stamper = New iTextSharp.text.pdf.PdfStamper(pReader, New System.IO.FileStream(DestinationFile, FileMode.Create))
        PB.Value = 0 : PB.Maximum = pReader.NumberOfPages
        For page As Integer = 1 To pReader.NumberOfPages
            Dim strategy As myLocationTextExtractionStrategy = New myLocationTextExtractionStrategy

            'cb = stamper.GetUnderContent(page)
            cb = stamper.GetOverContent(page)
            Dim state As New PdfGState()
            state.FillOpacity = 0.3F
            cb.SetGState(state)

            'Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, but i'm not sure if this could change in some cases
            strategy.UndercontentCharacterSpacing = cb.CharacterSpacing
            strategy.UndercontentHorizontalScaling = cb.HorizontalScaling

            'It's not really needed to get the text back, but we have to call this line ALWAYS, 
            'because it triggers the process that will get all chunks from PDF into our strategy Object
            Dim currentText As String = PdfTextExtractor.GetTextFromPage(pReader, page, strategy)

            'The real getter process starts in the following line
            Dim MatchesFound As List(Of iTextSharp.text.Rectangle) = strategy.GetTextLocations(pSearch, SC)

            'Set the fill color of the shapes, I don't use a border because it would make the rect bigger
            'but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover
            cb.SetColorFill(BaseColor.PINK)

            'MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color:

            For Each rect As iTextSharp.text.Rectangle In MatchesFound
                ' cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height)
                cb.SaveState()
                cb.SetColorFill(BaseColor.YELLOW)
                cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height)
                cb.Fill()
                cb.RestoreState()
            Next
            'cb.Fill()

            PB.Value = PB.Value + 1
        Next
        stamper.Close()
        pReader.Close()
    End If
    Me.Cursor = Cursors.Default

End Sub

第二点:你提供的解决方案在搜索文本位于提取文本的最后一行时无法正常工作。
    Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle)
        Dim FoundMatches As New List(Of iTextSharp.text.Rectangle)
        Dim sb As New StringBuilder()
        Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk)
        Dim bStart As Boolean, bEnd As Boolean
        Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing
        Dim sTextInUsedChunks As String = vbNullString

        ' For Each chunk As TextChunk In locationalResult
        For j As Integer = 0 To locationalResult.Count - 1
            Dim chunk As TextChunk = locationalResult(j)

            If chunk.text.Contains(pSearchString) Then
                Thread.Sleep(1)
            End If

            If ThisLineChunks.Count > 0 AndAlso (Not chunk.SameLine(ThisLineChunks.Last) Or j = locationalResult.Count - 1) Then
                If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then
                    Dim sLine As String = sb.ToString

                    'Check how many times the Search String is present in this line:
                    Dim iCount As Integer = 0
                    Dim lPos As Integer
                    lPos = sLine.IndexOf(pSearchString, 0, pStrComp)
                    Do While lPos > -1
                        iCount += 1
                        If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length
                        lPos = sLine.IndexOf(pSearchString, lPos, pStrComp)
                    Loop

                    'Process each match found in this Text line:
                    Dim curPos As Integer = 0
                    For i As Integer = 1 To iCount
                        Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer

                        iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp)
                        curPos = iFromChar
                        iToChar = iFromChar + pSearchString.Length - 1
                        sCurrentText = vbNullString
                        sTextInUsedChunks = vbNullString
                        FirstChunk = Nothing
                        LastChunk = Nothing

                        'Get first and last Chunks corresponding to this match found, from all Chunks in this line
                        For Each chk As TextChunk In ThisLineChunks
                            sCurrentText = sCurrentText & chk.text

                            'Check if we entered the part where we had found a matching String then get this Chunk (First Chunk)
                            If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then
                                FirstChunk = chk
                                bStart = True
                            End If

                            'Keep getting Text from Chunks while we are in the part where the matching String had been found
                            If bStart And Not bEnd Then
                                sTextInUsedChunks = sTextInUsedChunks & chk.text
                            End If

                            'If we get out the matching String part then get this Chunk (last Chunk)
                            If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then
                                LastChunk = chk
                                bEnd = True
                            End If

                            'If we already have first and last Chunks enclosing the Text where our String pSearchString has been found 
                            'then it's time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations
                            If bStart And bEnd Then
                                FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp))
                                curPos = curPos + pSearchString.Length
                                bStart = False : bEnd = False
                                Exit For
                            End If
                        Next
                    Next
                End If
                sb.Clear()
                ThisLineChunks.Clear()
            End If
            ThisLineChunks.Add(chunk)
            sb.Append(chunk.text)
        Next

        Return FoundMatches
    End Function

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接