有人知道如何使用苹果的视觉框架进行实时文本识别吗？

Question

有人知道如何使用苹果的视觉框架进行实时文本识别吗？

iosswiftcomputer-visionapple-vision

4

我似乎找不到一种方法，可以不使用文档扫描仪，而是用AVFoundation来补充它。我正在尝试创建一个功能，让用户可以点击一个按钮，扫描文本，然后将其保存到某个文本视图中，而不需要用户点击相机按钮，保持扫描，保存等等。

我已经成功使用对象检测，但无法使其用于文本识别。因此，是否有任何方法可以使用苹果的Vision框架进行实时文本识别？非常感谢任何帮助。

- notary

1

你尝试过这个吗？https://developer.apple.com/documentation/vision/recognizing_text_in_images - aiwiguna

1

只要你能从AVCaptureSession中获取UIImage，就可以使用UIImage进行编程。 - aiwiguna

所以..我的解决方案有效吗？ :) - Pranav Kasetti

是的！非常感谢你！ - notary

1

我制作了一个可以实现这个功能的应用程序。这是文章，这是代码库。 - aheze

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pranav Kasetti · Accepted Answer

为了提高性能，我更倾向于不将CMSampleBuffer转换为UIImage，而是使用以下代码创建AVCaptureVideoPreviewLayer以进行实时视频：

class CameraFeedView: UIView {
    private var previewLayer: AVCaptureVideoPreviewLayer!
    
    override class var layerClass: AnyClass {
        return AVCaptureVideoPreviewLayer.self
    }
    
    init(frame: CGRect, session: AVCaptureSession, videoOrientation: AVCaptureVideoOrientation) {
        super.init(frame: frame)
        previewLayer = layer as? AVCaptureVideoPreviewLayer
        previewLayer.session = session
        previewLayer.videoGravity = .resizeAspect
        previewLayer.connection?.videoOrientation = videoOrientation
    }
    
    required init?(coder: NSCoder) {
        fatalError("init(coder:) has not been implemented")
    }
}

一旦您拥有了这个，您就可以使用 Vision 处理实时视频数据：

class CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
  
  private let videoDataOutputQueue = DispatchQueue(label: "CameraFeedDataOutput", qos: .userInitiated,
                                                   attributes: [], autoreleaseFrequency: .workItem)
  private var drawingView: UILabel = {
    let view = UILabel(frame: UIScreen.main.bounds)
    view.font = UIFont.boldSystemFont(ofSize: 30.0)
    view.textColor = .red
    view.translatesAutoresizingMaskIntoConstraints = false
    return view
  }()
  private var cameraFeedSession: AVCaptureSession?
  private var cameraFeedView: CameraFeedView! //Wrap

  override func viewDidLoad() {
    super.viewDidLoad()
    do {
      try setupAVSession()
    } catch {
      print("setup av session failed")
    }
  }

  func setupAVSession() throws {
    // Create device discovery session for a wide angle camera
    let wideAngle = AVCaptureDevice.DeviceType.builtInWideAngleCamera
    let discoverySession = AVCaptureDevice.DiscoverySession(deviceTypes: [wideAngle], mediaType: .video, position: .back)
    
    // Select a video device, make an input
    guard let videoDevice = discoverySession.devices.first else {
      print("Could not find a wide angle camera device.")
    }
    
    guard let deviceInput = try? AVCaptureDeviceInput(device: videoDevice) else {
      print("Could not create video device input.")
    }
    
    let session = AVCaptureSession()
    session.beginConfiguration()
    // We prefer a 1080p video capture but if camera cannot provide it then fall back to highest possible quality
    if videoDevice.supportsSessionPreset(.hd1920x1080) {
      session.sessionPreset = .hd1920x1080
    } else {
      session.sessionPreset = .high
    }
    
    // Add a video input
    guard session.canAddInput(deviceInput) else {
      print("Could not add video device input to the session")
    }
    session.addInput(deviceInput)
    
    let dataOutput = AVCaptureVideoDataOutput()
    if session.canAddOutput(dataOutput) {
      session.addOutput(dataOutput)
      // Add a video data output
      dataOutput.alwaysDiscardsLateVideoFrames = true
      dataOutput.videoSettings = [
        String(kCVPixelBufferPixelFormatTypeKey): Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)
      ]
      dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
    } else {
      print("Could not add video data output to the session")
    }
    let captureConnection = dataOutput.connection(with: .video)
    captureConnection?.preferredVideoStabilizationMode = .standard
    captureConnection?.videoOrientation = .portrait
    // Always process the frames
    captureConnection?.isEnabled = true
    session.commitConfiguration()
    cameraFeedSession = session
    
    // Get the interface orientaion from window scene to set proper video orientation on capture connection.
    let videoOrientation: AVCaptureVideoOrientation
    switch view.window?.windowScene?.interfaceOrientation {
      case .landscapeRight:
        videoOrientation = .landscapeRight
      default:
        videoOrientation = .portrait
    }
    
    // Create and setup video feed view
    cameraFeedView = CameraFeedView(frame: view.bounds, session: session, videoOrientation: videoOrientation)
    setupVideoOutputView(cameraFeedView)
    cameraFeedSession?.startRunning()
  }

你一旦设置好了AVCaptureSession，就需要实现两个关键功能：代理和请求处理程序。

  func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    
    let requestHandler = VNImageRequestHandler(cmSampleBuffer: sampleBuffer, orientation: .down)
    
    let request = VNRecognizeTextRequest(completionHandler: textDetectHandler)
    
    do {
      // Perform the text-detection request.
      try requestHandler.perform([request])
    } catch {
      print("Unable to perform the request: \(error).")
    }
  }
  
  func textDetectHandler(request: VNRequest, error: Error?) {
    guard let observations =
            request.results as? [VNRecognizedTextObservation] else { return }
    // Process each observation to find the recognized body pose points.
    let recognizedStrings = observations.compactMap { observation in
        // Return the string of the top VNRecognizedText instance.
        return observation.topCandidates(1).first?.string
    }
    
    DispatchQueue.main.async {
      self.drawingView.text = recognizedStrings.first
    }
  }
}

注意，您可能希望处理每个recognizedStrings以选择置信度最高的那一个，但这只是一个概念证明。您还可以添加一个边界框，文档中有相关示例。