PDFBox的
PDFTextStripper
类中仍具有位置信息的文本(在变为纯文本之前)的最后一种方法是:
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
应该在这里拦截,因为该方法接收预处理的、特别是已经排序的TextPosition
对象(如果一开始就要求排序)。
(实际上,我更喜欢在调用方法writeLine
中拦截,因为根据其参数和本地变量的名称,其所有TextPosition
实例都有一个行,并且每个word
调用writeString
一次;不幸的是,PDFBox开发人员已将此方法声明为私有...嗯,也许在最终的2.0.0版本中会有所改变...暗示,暗示。更新:不幸的是,在发布中它没有改变...叹气)
此外,使用一个帮助类来包装TextPosition
实例序列,以使代码更清晰,也很有帮助。
有了这个想法,可以像这样搜索变量。
List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
TextPositionSequence word = new TextPositionSequence(textPositions);
String string = word.toString();
int fromIndex = 0;
int index;
while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
{
hits.add(word.subSequence(index, index + searchTerm.length()));
fromIndex = index + 1;
}
super.writeString(text, textPositions);
}
};
stripper.setSortByPosition(true);
stripper.setStartPage(page);
stripper.setEndPage(page);
stripper.getText(document);
return hits;
}
使用这个帮助类。
public class TextPositionSequence implements CharSequence
{
public TextPositionSequence(List<TextPosition> textPositions)
{
this(textPositions, 0, textPositions.size());
}
public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
{
this.textPositions = textPositions;
this.start = start;
this.end = end;
}
@Override
public int length()
{
return end - start;
}
@Override
public char charAt(int index)
{
TextPosition textPosition = textPositionAt(index);
String text = textPosition.getUnicode();
return text.charAt(0);
}
@Override
public TextPositionSequence subSequence(int start, int end)
{
return new TextPositionSequence(textPositions, this.start + start, this.start + end);
}
@Override
public String toString()
{
StringBuilder builder = new StringBuilder(length());
for (int i = 0; i < length(); i++)
{
builder.append(charAt(i));
}
return builder.toString();
}
public TextPosition textPositionAt(int index)
{
return textPositions.get(start + index);
}
public float getX()
{
return textPositions.get(start).getXDirAdj();
}
public float getY()
{
return textPositions.get(start).getYDirAdj();
}
public float getWidth()
{
if (end == start)
return 0;
TextPosition first = textPositions.get(start);
TextPosition last = textPositions.get(end - 1);
return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
}
final List<TextPosition> textPositions;
final int start, end;
}
要仅输出它们的位置、宽度、最终字母和最终字母位置,您可以使用以下代码:
void printSubwords(PDDocument document, String searchTerm) throws IOException
{
System.out.printf("* Looking for '%s'\n", searchTerm);
for (int page = 1; page <= document.getNumberOfPages(); page++)
{
List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
for (TextPositionSequence hit : hits)
{
TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
System.out.printf(" Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
page, hit.getX(), hit.getY(), hit.getWidth(),
lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
}
}
}
为了测试,我使用MS Word创建了一个小的测试文件:
这个测试的输出:
@Test
public void testVariables() throws IOException
{
try ( InputStream resource = getClass().getResourceAsStream("Variables.pdf");
PDDocument document = PDDocument.load(resource); )
{
System.out.println("\nVariables.pdf\n-------------\n");
printSubwords(document, "${var1}");
printSubwords(document, "${var 2}");
}
}
是什么意思?
Variables.pdf
-------------
* Looking for '${var1}'
Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18
* Looking for '${var 2}'
Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
我有点惊讶,因为如果在单行上找到了${var 2}
;毕竟,PDFBox代码让我认为我重写的writeString
方法只检索单词;看起来它检索比单词更长的行部分...
如果您需要来自已分组的TextPosition
实例的其他数据,请相应地增强TextPosition序列
。
writeString
代码片段是否按正确顺序输出字符?如果没有,你是否使用setSortByPosition(true)
初始化了PdfTextStripper
? - mkl