iTextSharp替换现有PDF中的文本而不会丢失格式

Question

iTextSharp替换现有PDF中的文本而不会丢失格式

c#pdfitext

10

我已经在互联网上搜索了两周时间，找到了一些有趣的解决方案，但似乎没有一个能够给我答案。

我的目标是：

我想在静态PDF文件中查找文本并将其替换为其他文本。我希望保留内容的设计。这真的那么难吗？

我找到了一种方法，但我失去了所有的信息：

 using (PdfReader reader = new PdfReader(path))
        {

            StringBuilder text = new StringBuilder();
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                text.Replace(txt_SuchenNach.Text, txt_ErsetzenMit.Text);
            }

            return text.ToString();
        }

我的第二次尝试效果好多了，但是需要添加字段以便我可以修改其中的文本内容：

 string fileNameExisting =path;
        string fileNameNew = @"C:\TEST.pdf";

        using (FileStream existingFileStream = new FileStream(fileNameExisting, FileMode.Open))
        using (FileStream newFileStream = new FileStream(fileNameNew, FileMode.Create))
        {
            // PDF öffnen
            PdfReader pdfReader = new PdfReader(existingFileStream);


            PdfStamper stamper = new PdfStamper(pdfReader, newFileStream);

            var form = stamper.AcroFields;
            var fieldKeys = form.Fields.Keys;
            foreach (string fieldKey in fieldKeys)
            {                    
                var value = pdfReader.AcroFields.GetField(fieldKey);
                form.SetField(fieldKey, value.Replace(txt_SuchenNach.Text, txt_ErsetzenMit.Text));
            }

            // Textfeld unbearbeitbar machen (sieht aus wie normaler text)
            stamper.FormFlattening = true;

            stamper.Close();
            pdfReader.Close();
        }

这将保留其余文本的格式，并仅更改我搜索的文本。我需要一个针对不在文本字段中的文本的解决方案。

感谢您所有的答案和帮助。

- Kevin Plaul

2

“这真的很难吗？” 是的，一般来说是的。您是否了解字体子集？如果插入一个不在现有子集中的字符，该怎么办？您需要找出最初使用的字体（并非总是易于），然后在系统上安装该字体。（除此之外还有其他问题——我看到这是一个重复的问题。） - Jongware

嗨Jongware，我知道已经有一个类似于我的帖子，但没有任何“Maybe”代码和答案“NO”，这不是一个好答案。 =）但是谢谢你的评论。我讨厌PDF - Kevin Plaul

1

“不行，无法实现”是一个好的回答。无论你在互联网上搜索多久，都找不到从英国走路到美国的方法。 - Jongware

2个回答

3

我曾经处理过类似的需求，以下是实现步骤：

第一步：定位源PDF文件和目标文件路径

第二步：读取源PDF文件并查找需要替换的字符串位置

第三步：用新的字符串替换原有字符串。

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using PDFExtraction;    
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;

namespace PDFReplaceTextUsingItextSharp
{
    public partial class ExtractPdf : System.Web.UI.Page
    {
        static iTextSharp.text.pdf.PdfStamper stamper = null;
        protected void Page_Load(object sender, EventArgs e)
        {

        }

        protected void Replace_Click(object sender, EventArgs e)
        {
            string ReplacingVariable = txtReplace.Text; 
            string sourceFile = "Source File Path";
            string descFile = "Destination File Path";
            PdfReader pReader = new PdfReader(sourceFile);
            stamper = new iTextSharp.text.pdf.PdfStamper(pReader, new System.IO.FileStream(descFile, System.IO.FileMode.Create));
            PDFTextGetter("ExistingVariableinPDF", ReplacingVariable , StringComparison.CurrentCultureIgnoreCase, sourceFile, descFile);
            stamper.Close();
            pReader.Close();
        }


        /// <summary>
        /// This method is used to search for the location words in pdf and update it with the words given from replacingText variable
        /// </summary>
        /// <param name="pSearch">Searchable String</param>
        /// <param name="replacingText">Replacing String</param>
        /// <param name="SC">Case Ignorance</param>
        /// <param name="SourceFile">Path of the source file</param>
        /// <param name="DestinationFile">Path of the destination file</param>
        public static void PDFTextGetter(string pSearch, string replacingText, StringComparison SC, string SourceFile, string DestinationFile)
        {
            try
            {
                iTextSharp.text.pdf.PdfContentByte cb = null;
                iTextSharp.text.pdf.PdfContentByte cb2 = null;
                iTextSharp.text.pdf.PdfWriter writer = null;
                iTextSharp.text.pdf.BaseFont bf = null;

                if (System.IO.File.Exists(SourceFile))
                {
                    PdfReader pReader = new PdfReader(SourceFile);


                    for (int page = 1; page <= pReader.NumberOfPages; page++)
                    {
                        myLocationTextExtractionStrategy strategy = new myLocationTextExtractionStrategy();
                        cb = stamper.GetOverContent(page);
                        cb2 = stamper.GetOverContent(page);

                        //Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, 
                        //but i'm not sure if this could change in some cases
                        strategy.UndercontentCharacterSpacing = (int)cb.CharacterSpacing;
                        strategy.UndercontentHorizontalScaling = (int)cb.HorizontalScaling;

                        //It's not really needed to get the text back, but we have to call this line ALWAYS, 
                        //because it triggers the process that will get all chunks from PDF into our strategy Object
                        string currentText = PdfTextExtractor.GetTextFromPage(pReader, page, strategy);

                        //The real getter process starts in the following line
                        List<iTextSharp.text.Rectangle> MatchesFound = strategy.GetTextLocations(pSearch, SC);

                        //Set the fill color of the shapes, I don't use a border because it would make the rect bigger
                        //but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover
                        cb.SetColorFill(BaseColor.WHITE);

                        //MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color:

                        foreach (iTextSharp.text.Rectangle rect in MatchesFound)
                        {
                            //width
                            cb.Rectangle(rect.Left, rect.Bottom, 60, rect.Height);
                            cb.Fill();
                            cb2.SetColorFill(BaseColor.BLACK);
                            bf = BaseFont.CreateFont(BaseFont.HELVETICA_BOLD, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);

                            cb2.SetFontAndSize(bf, 9);

                            cb2.BeginText();
                            cb2.ShowTextAligned(0, replacingText, rect.Left, rect.Bottom, 0);   
                            cb2.EndText();
                            cb2.Fill();
                        }

                    }
                }

            }
            catch (Exception ex)
            {

            }

        }

    }
}

- Pradeep Kumar

你在哪里“替换”？特别是，你在哪里删除原始文本，在哪里添加新文本使用与原始文本相同的样式？ - mkl

cb = stamper.GetOverContent(page); cb2 = stamper.GetOverContent(page); 这里的cb将获取PDF页面上的文本内容，而cb2将获取PDF页面的白色背景.............首先我们将搜索现有字符串的位置并将其存储在“MatchesFound”变量中，然后在现有字符串上填充白色颜色cb.SetColorFill(BaseColor.WHITE)....之后，我们将循环MatchesFound对象，并在白色涂漆字符串的相同位置填充新字符串...希望你能理解我的意思... - Pradeep Kumar

1

在现有字符串上填充白色颜色，这并不是“删除”，因为文本仍然可以复制和粘贴。只要PDF文件只需打印，那就没问题，但如果需要电子分发，这可能会成为一个难以解决的问题。 - mkl

是的，对于分发情况来说这是不可行的……只有在修改后下载PDF表格时才是可行的。 - Pradeep Kumar

你有itextpdf v7的更新吗？PdfStamper在v7中好像不存在了:( - CularBytes

4

请问您能否添加您的自定义继承类 "myLocationTextExtractionStrategy"？它有什么作用？ - Tech Yogesh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Eugene · Accepted Answer

一般问题是文本对象可能使用嵌入字体，其中特定的字形分配给特定的字母。例如，如果您有一个包含“abcdef”文本的文本对象，则嵌入的字体可能仅包含这些（“abcdef”字母）的字形，而不包含其他字母的字形。因此，如果您用“xyz”替换“abcdef”，则 PDF 将不会显示这些“xyz”，因为没有可用于显示这些字母的字形。

所以我建议按照以下工作流程进行：

- 遍历所有文本对象； - 在 PDF 文件上方创建新的文本对象，并设置相同的属性（字体、位置等），但使用不同的文本；此步骤可能需要您安装与原始 PDF 中使用的相同字体相同的字体，但您可以检查已安装的字体并对新文本对象使用其他字体。这样，iTextSharp 或其他 PDF 工具将为新文本对象嵌入新的字体对象。 - 创建重复的文本对象后，删除原始文本对象； - 使用上述工作流程处理每个文本对象； - 将修改后的 PDF 文档保存到新文件中。