在Python Tesseract OCR中提高准确性

3
我正在使用pytesseractopenCV在Python的简单Django应用程序中,从图像文件中提取孟加拉语文本。我有一个表单,让你上传图片,并在单击提交按钮时通过jQuery的ajax调用将其发送到服务器端,以从图像中提取文本以实现OCR(光学字符识别)。

模板部分:

 <div style="text-align: center;">
 <div id="result" class="text-center"></div>
    <form enctype="multipart/form-data" id="ocrForm" action="{% url 'process_image' %}" method="post"> <!-- Do not forget to add: enctype="multipart/form-data" -->
        {% csrf_token %}
        {{ form }}
        <button type="submit" class="btn btn-success">OCRzed</button>
    </form>

    <br><br><hr>
    <div id="content" style="width: 50%; margin: 0 auto;">
        
    </div>
</div>


<script type="text/javascript">




 $(document).ready(function(){ 
        function submitFile(){
            var fd = new FormData();
            fd.append('file', getFile())
            $("#result").html('<span class="wait">Please wait....</span>');

            $('#content').html('');
            $.ajax({
                url: "{% url 'process_image' %}",
                type: "POST",
                data: fd,
                processData: false,
                contentType: false,
                success: function(data){
                    // console.log(data.content);

            $("#result").html('');

                    if(data.content){
                        $('#content').html(
                            "<p>" + data.content + "</p>"
                        )
                    }  
                }
            })
        }

        function getFile(){
            var fp = $("#file_id")
            var item = fp[0].files
            return item[0]
        }

        // Submit the file for OCRization
        $("#ocrForm").on('submit', function(event){
            event.preventDefault();
            submitFile()
        })
    });






</script>

urls.py文件包含:

from django.urls import path, re_path
from .views import *

urlpatterns = [
 path('process_image', OcrView.process_image, name='process_image') ,
]

视图部分:

from django.contrib.auth.models import User
from django.shortcuts  import render, redirect, get_object_or_404
from .forms import NewTopicForm
from .models import Board, Topic, Post
from django.shortcuts import render
from django.http import HttpResponse
from django.http import Http404
    
from django.http import JsonResponse
from django.views.generic import FormView
    
from django.views.decorators.csrf import csrf_exempt
import json
import cv2
import numpy as np
    
import pytesseract    # ======= > Add
try:
     from PIL import Image
except:
        import Image

def ocr(request):
    return render(request, 'ocr.html')
    #    {'board': board,'form':form})    

# get grayscale image
def get_grayscale(image):
         return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# noise removal
def remove_noise(image):
         return cv2.medianBlur(image,5)
 
#thresholding
def thresholding(image):
         return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

#dilation
def dilate(image):
         kernel = np.ones((5,5),np.uint8)
         return cv2.dilate(image, kernel, iterations = 1)
    
#erosion
def erode(image):
       kernel = np.ones((5,5),np.uint8)
       return cv2.erode(image, kernel, iterations = 1)

#opening - erosion followed by dilation
def opening(image):
        kernel = np.ones((5,5),np.uint8)
        return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)

#canny edge detection
def canny(image):
        return cv2.Canny(image, 100, 200)

#skew correction
def deskew(image):
       coords = np.column_stack(np.where(image > 0))
       angle = cv2.minAreaRect(coords)[-1]
       if angle < -45:
         angle = -(90 + angle)
       else:
         angle = -angle
       (h, w) = image.shape[:2]
       center = (w // 2, h // 2)
       M = cv2.getRotationMatrix2D(center, angle, 1.0)
       rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
       return rotated

#template matching
def match_template(image, template):
       return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
 
class OcrView(FormView):
    form_class = UploadForm
    template_name = 'ocr.html'
    success_url = '/'

    
    @csrf_exempt
    def process_image(request):
        if request.method == 'POST':
          response_data = {}
          upload = request.FILES['file']
        
        filestr = request.FILES['file'].read()
        #convert string data to numpy array
        npimg = np.fromstring(filestr, np.uint8)
        image = cv2.imdecode(npimg, cv2.IMREAD_UNCHANGED)

        # image=Image.open(upload)
        gray = get_grayscale(image)
        thresh = thresholding(gray)
        opening1 = opening(gray)
        canny1 = canny(gray)
       
        pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
        # content = pytesseract.image_to_string(Image.open(upload), lang = 'ben')

        # content = pytesseract.image_to_string( image, lang = 'ben')

        content = pytesseract.image_to_string( image, lang = 'eng+ben')

        #   data_ben = process_image("test_ben.png", "ben")
        response_data['content'] = content

        return JsonResponse(response_data)

我在下面附上了一个示例图像,当我将其作为输入文件时,从中提取的文本精度不够令人满意。输入图像如下: Input file 我附上了提取文本的屏幕截图,其中错误单词用红色下划线标出。请注意,空格和缩进没有被保留。提取文本的屏幕截图如下: enter image description here 在上面的代码片段中,我使用以下代码行进行了图像处理:
gray = get_grayscale(image)
thresh = thresholding(gray)
opening1 = opening(gray)
canny1 = canny(gray)

接下来,我将已处理的图像提供给Tesseract:

content = pytesseract.image_to_string( image, lang = 'eng+ben')

但是我困惑的地方在于,在处理前或处理后,我都没有保存过这个图像。因此,当我使用上面的代码时,我不确定是供应给 Tesserect 引擎的是处理过的图像还是未经处理的图像。

Q1) 在处理完图像后,我需要保存它然后再提供给 Tesserect 引擎吗?如果是的话,如何操作?

Q2) 我需要采取哪些其他步骤来提高准确性?

NB:即使您不熟悉孟加拉语,我认为这不会有任何问题,因为您可以查看红色下划线的单词并进行比较。

编辑:

TL;DR: 您只需查看 view.pyurls.py 文件中的代码,并排除模板代码以便更容易理解。


你找到解决办法了吗? - Gary Chen
@XueQing,还没有。 - Istiaque Ahmed
1个回答

1

问题1:不需要保存图像。图像已存储在变量image中。

问题2:您实际上没有在应用于图像后处理函数(即变量canny1)上进行OCR。下面的代码将依次对image执行处理步骤,然后对存储在canny1中的后处理图像应用OCR。

gray = get_grayscale(image)
thresh = thresholding(gray)
opening1 = opening(thresh )
canny1 = canny(opening1 )

content = pytesseract.image_to_string( canny1 , lang = 'eng+ben')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接