使用SFSpeechRecognizer进行单字语音识别？

Question

使用SFSpeechRecognizer进行单字语音识别？

4

我正在编写一个拼字应用程序。我一直在使用SFSpeechRecognizer，但它在处理单个字母时表现不佳，我猜测它是在寻找口语短语。我已经搜索了一段时间的SFSpeechRecognizer，但没有找到关于如何让它更好地识别单个字母的信息。我不得不生成一个列表，当说出字母时，SFSpeechRecognizer会将其排除，并基于该列表进行验证。是否有SFSpeechRecognizer中的某些设置可以使其更好地处理单个发音的字母？

- Will

还是有其他更适合单字识别的框架吗？ - Will

2个回答

阿里云服务器只需要99元/年，新老用户同享，点击查看详情

0

尽管这个帖子很旧了，但我可能有一些好的结果可以分享给路过的任何人。

我使用的“技巧”是实际上让一个字母对应于一个“词语”或类似的内容：

recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
recognitionRequest.shouldReportPartialResults = true

// Associate a letter to a word or an onomatopoeia
var letters = ["Hey": "A",
               "Bee": "B",
               "See": "C",
               "Dee": "D",
               "He": "E",
               "Eff": "F",
               "Gee": "G",
               "Atch": "H"]

// This tells the speech recognition to focus on those words
recognitionRequest.contextualStrings = letters.key

然后，在recognitionTask中接收音频时，我们访问字典以检测单词与哪个字母相关联。

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error  in
    var isFinal = false
    
    if let result = result {
        isFinal = result.isFinal
        
        let bestTranscription = result.bestTranscription
  
        // extract the confidence the recognizer has in this word 
        let confidence = bestTranscription.segments.isEmpty ? -1 : bestTranscription.segments[0].confidence
        
        print("Best \(result.bestTranscription.formattedString) - Confidence: \(confidence)")
        
        // Only keep results with some confidence 
        if confidence > 0 {
            
            // If the transcription matches one of our keys we can retrieve the letter
            if letters.key.map({ $0.lowercased() }) .contains(result.bestTranscription.formattedString.lowercased()) {
                let detected = result.bestTranscription.formattedString
                print("Letter: \(letters[detected])")
                
                // And stop recording afterwards 
                self.stopRecording()
            }
        }
    }

    if error != nil || isFinal {
        // The rest of the boilerplate from Apple's doc sample probect... 
    }
}

注意：

将shouldReportPartialResults设置为true非常重要，否则它会在发送结果之前等待相当长的时间。
经过一些测试，当您设置recognitionRequest.contextualStrings时，当它识别出其中一个字符串时，置信度往往会飙升。您可以将置信度阈值提高到0.3或0.4。
完成字母表可能需要很长时间，因为有时一个单词会被识别为另一个单词。对于前8个字母（例如：我尝试用“age”代替“H”，但它一直被识别为“Hey”，即“A”），需要进行大量的试错才能得到良好的结果。

一些结果：

Best Gee - Confidence: 0.0  
// ... after a while, half a second maybe ...  
Best Gee - Confidence: 0.864  
Found G

（苹果的示例项目用于测试：https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio）

- Olympiloutre

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，

- Ganpat · Accepted Answer

检查答案: https://dev59.com/nFgQ5IYBdhLWcg3wlk_P#42925643

声明一个 String 变量来存储识别出的单词。

在音频会话开始时创建一个 Timer：
strWords = "" var timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTask", userInfo: nil, repeats: false)
在 recognitionTaskWithRequest 块中添加以下代码：
strWords = result.bestTranscription.formattedString
如果计时器过期并调用了 didFinishTalk，则：
if strWords == "" { timer.invalidate() timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false) } else { // 使用 "strWord" 进行操作 }