我想知道是否有使用Tika/Python仅解析第一页或仅提取第一页元数据的方法?目前,当我传递pdf文件时,它会解析每一页。我查看了这个链接:Is it possible to extract text by page for word/pdf files using Apache Tika? 但是,这个链接更多地解释了Java,而我不熟悉Java。我希望能有一个Python的解决方案。谢谢!
from tika import parser
# running: java -jar tika-server1.18.jar before executing code below.
parsedPDF = parser.from_file('C:\\path\\to\\dir\\sample.pdf')
fulltext = parsedPDF['content']
metadata_dict = parsedPDF['metadata']
title = metadata_dict['title']
author = metadata_dict['Author'] # capturing all the names from lets say 15 pages. Just want it to capture from first page
pages = metadata_dict['xmpTPg:NPages']