Google语音识别API：每个单词的时间戳？

Question

Google语音识别API：每个单词的时间戳？

audiospeech-recognitionspeech-to-textspeechgoogle-speech-api

25

可以使用Google的语音识别API对音频文件（WAV、MP3等）进行转录，只需要向http://www.google.com/speech-api/v2/recognize?...发送请求即可。

例如：我在一个WAV文件中说了“一二三四五”，Google API 给出了以下结果：

{
  u'alternative':
  [
    {u'transcript': u'12345'},
    {u'transcript': u'1 2 3 4 5'},
    {u'transcript': u'one two three four five'}
  ],
  u'final': True
}

问题：是否可能获得每个单词说话的时间（以秒为单位）？

以我的示例为例：

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

即单词"one"在00:00:00.23至00:00:00.80期间被提及，
而单词"two"则在00:00:01.03至00:00:01.45之间被提及（以秒为单位）。

PS：寻找支持英语以外其他语言（尤其法语）的API。

- Basj

嗯？据我所知，Google语音API确实支持法语，不是吗？ - Ctx

@Ctx 是的，但它不支持每个单词的时间戳。 - Basj

3个回答

13

编辑于2020年：现在可行，请参见其他答案

使用谷歌API不可能实现。

如果你想要单词时间戳，你可以使用其他的API，例如：

Vosk-API - 免费离线语音识别API（声明：我是Vosk的主要作者）。

SpeechMatics SaaS语音识别API

IBM的语音识别API

- Nikolay Shmyrev

谢谢！你试过这三个API吗？它们和Google的一样好吗？我每天都对Google的语音识别有多强大感到惊讶。（我向我的Android手机大声说出我的短信，手机几乎没有犯错！） - Basj

它们在准确性方面应该是可比较的。 - Nikolay Shmyrev

很遗憾，似乎没有一个支持法语语言。 - Basj

6

我们尝试使用IBM BlueMix语音API来实现这一目的，但发现其准确性非常糟糕。即使是简单、清晰地发音独立单词"spoon"，返回结果也可能是"moon"、"room"、"doom"、"bloom"或"whom"。此外，我还事先将关键词集设为("spoon")且接受概率较低。正如原帖提到的那样，IBM提供了每个单词的开始和结束时间（而Google似乎没有），然而准确度太低以至于无法使用。 - Hephaestus

@Hephaestus，你发现哪个供应商提供的准确度最高？Google？ - Andy

9

是的，这是完全可能的。您需要做的只是：

在配置中设置 enable_word_time_offsets=True

config = types.RecognitionConfig(
        ....
        enable_word_time_offsets=True)

然后，对于备选词中的每个单词，您可以像以下代码一样打印其开始时间和结束时间：

for result in result.results:
        alternative = result.alternatives[0]
        print(u'Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

这将以以下格式输出结果：

Transcript:  Do you want me to give you a call back?
Confidence: 0.949534416199
Word: Do, start_time: 1466.0, end_time: 1466.6
Word: you, start_time: 1466.6, end_time: 1466.7
Word: want, start_time: 1466.7, end_time: 1466.8
Word: me, start_time: 1466.8, end_time: 1466.9
Word: to, start_time: 1466.9, end_time: 1467.1
Word: give, start_time: 1467.1, end_time: 1467.2
Word: you, start_time: 1467.2, end_time: 1467.3
Word: a, start_time: 1467.3, end_time: 1467.4
Word: call, start_time: 1467.4, end_time: 1467.6
Word: back?, start_time: 1467.6, end_time: 1467.7

来源：https://cloud.google.com/speech-to-text/docs/async-time-offsets

本文介绍了Google云端语音转文本API中的异步时间偏移量功能。使用该功能，您可以获取音频文件中每个单词或短语的开始和结束时间戳，以便更好地理解音频内容。该功能支持多种语言，包括英语、法语、德语、意大利语、日语、韩语、葡萄牙语、俄语、西班牙语和土耳其语。

- Ishmeet Kaur

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- deweydb · Accepted Answer

我认为其他答案现在已经过时了。使用Google Cloud Search API，现在可以实现此功能： https://cloud.google.com/speech/docs/async-time-offsets