我正在编写一个脚本,以将扫描的PDF文件转换为文本行并输入到数据库中。我使用re.findall从正则表达式列表中获取匹配项,以从tesseract提取的字符串中获取特定值。当正则表达式找不到我要匹配的内容时,我想让它返回“错误”信息,以便我能够看到问题所在。我尝试了几种if/else语句,但是似乎无法注意到None值。
from wand.image import Image as Img
import ghostscript
from PIL import Image
import pytesseract
import re
import os
def get_text_from_pdf(pendingpdf,pendingimg):
with Img(filename=pendingpdf, resolution=300) as img:
img.compression_quality = 99
img.save(filename=pendingimg)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
extractedtext = pytesseract.image_to_string(Image.open(pendingimg))
os.unlink(pendingimg)
return extractedtext
def get_results(vendor,extracted_string,results):
for v in vendor:
pattern = re.compile(v)
for match in re.findall(pattern,extracted_string):
if type(match) is str:
results.append(match)
else:
results.append("Error")
return results
pendingpdf = r'J:\TBHscan07022019090315001.pdf'
pendingimg = 'Test1.jpg'
aggind = ["^(\w+)(?:.+)\n+3600",
"Ticket: (nonsensewordstothrowerror)",
"Ticket: \d+\s([0-9|/]+)",
"Product: (\w+.+)\n",
"Quantity: ([\d\.]+)",
"Truck (\w+)"]
vendor = aggind
extracted_string = get_text_from_pdf(pendingpdf,pendingimg)
results = []
print(get_results(vendor,get_text_from_pdf(pendingpdf,pendingimg),results))
try except
块吗? - saha rudraif else
更符合 Python 的风格,但 OP 仍然需要知道何时抛出异常。 - Lucas Wielochtry except
块是不必要的,当没有匹配时,re.findall
不会抛出异常。 - Alex