从网络流式传输音频时,无法从Google语音合成API获得结果

33
我想使用Python的Google-cloud-speech API从Web流式传输音频并将其转换为文本。我已经在我的Django channels代码中集成了它。对于前端,我直接复制了这个代码,后端有这个代码(请参见下面)。现在,问题来了,我没有得到任何异常或错误,但是我没有从谷歌API得到任何结果。 我尝试过:
  • 我在process函数的循环内设置了调试点,但控制权从未进入循环内。

  • 我已经阅读了这里的Java代码here并试图理解它。我已经在本地安装并调试了java代码。我理解的一件事是,在java代码中,方法onWebSocketBinary接收一个整数数组,我们从前端发送它。

  socket.send(Int16Array.from(floatSamples.map(function (n) {return n * MAX_INT;})));
  • 在Java中,他们将其转换为bytestring然后发送到Google。 而在Django中,我设置了调试点并注意到我得到的是二进制字符串数据。 因此,我觉得我不需要对此进行任何处理。 但是,我尝试了几种将其转换为整数数组的方法,但那行不通,因为Google期望的就是字节本身(您可以在下面的注释代码中看到)。

  • 我阅读了这个示例代码这个来自Google的,并且我正在做同样的事情,我不明白我在这里做错了什么。

  • Django代码:

    import json
    
    from channels.generic.websocket import WebsocketConsumer
    
    # Imports the Google Cloud client library
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    
    # Instantiates a client
    client = speech.SpeechClient()
    language_code = "en-US"
    streaming_config = None
    
    
    class SpeechToTextConsumer(WebsocketConsumer):
        def connect(self):
            self.accept()
    
        def disconnect(self, close_code):
            pass
    
        def process(self, streaming_recognize_response: types.StreamingRecognitionResult):
            for response in streaming_recognize_response:
                if not response.results:
                    continue
                result = response.results[0]
                self.send(text_data=json.dumps(result))
    
        def receive(self, text_data=None, bytes_data=None):
            global streaming_config
            if text_data:
                data = json.loads(text_data)
                rate = data["sampleRate"]
                config = types.RecognitionConfig(
                    encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
                    sample_rate_hertz=rate,
                    language_code=language_code,
                )
                streaming_config = types.StreamingRecognitionConfig(
                    config=config, interim_results=True, single_utterance=False
                )
                types.StreamingRecognizeRequest(streaming_config=streaming_config)
                self.send(text_data=json.dumps({"message": "processing..."}))
            if bytes_data:
                # bytes_data = bytes_data[math.floor(len(bytes_data) / 2) :]
                # bytes_data = bytes_data.lstrip(b"\x00")
                # bytes_data = int.from_bytes(bytes_data, "little")
                stream = [bytes_data]
                requests = (
                    types.StreamingRecognizeRequest(audio_content=chunk) for chunk in stream
                )
                responses = client.streaming_recognize(streaming_config, requests)
                self.process(responses)
    

    重要的是您还要发布接收WebSocket请求的服务器端代码...提供一个最小完整的服务器和客户端示例将激励更多人伸出援手...我已经创建了这样的东西,它并不太难。 - Scott Stensland
    @ScottStensland 感谢您的帮助,后端代码基本上运行良好...没有问题...我只需要知道如何流式传输音频就可以了... - Lokesh Sanapalli
    你如何测试后端代码?它是否能够与你的语音流一起工作? - Ali Asgari
    @Lokesh Sanapalli,你有没有找到解决方法? - Mišel Ademi
    @MišelAdemi 不,还在等待。 - Lokesh Sanapalli
    1个回答

    1
    我在创建虚拟人工智能助手时遇到了类似的问题,认为我可以提供一些帮助。我不是专家,但我找到了一种实现Google文本转语音引擎的方法。我使用了Python的speech_recognition库(您可以使用pip install speech_recognition进行下载),并将其导入为“sr”。从这里开始,您可以使用recognize.recognize_google(audio file)来设置Google的API。您不需要账户,因为该库已经包含了一个密钥,并且非常容易设置和实现,例如Django。这是一个非常有用的链接,我建议您参考一下。这是文档的链接。这是一个有用的程序,它使用所有可用的语音识别服务获取音频文件并将其转录。以下是代码,您可以使用任何您喜欢的服务,sphinx离线运行,而Google的API不需要注册,因为它已经有了密钥和密码。
        #!/usr/bin/env python3
    
    import speech_recognition as sr
    
    # obtain path to "english.wav" in the same folder as this script
    from os import path
    AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")
    # AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "french.aiff")
    # AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "chinese.flac")
    
    # use the audio file as the audio source
    r = sr.Recognizer()
    with sr.AudioFile(AUDIO_FILE) as source:
        audio = r.record(source)  # read the entire audio file
    
    # recognize speech using Sphinx
    try:
        print("Sphinx thinks you said " + r.recognize_sphinx(audio))
    except sr.UnknownValueError:
        print("Sphinx could not understand audio")
    except sr.RequestError as e:
        print("Sphinx error; {0}".format(e))
    
    # recognize speech using Google Speech Recognition
    try:
        # for testing purposes, we're just using the default API key
        # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
        # instead of `r.recognize_google(audio)`
        print("Google Speech Recognition thinks you said " + r.recognize_google(audio))
    except sr.UnknownValueError:
        print("Google Speech Recognition could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from Google Speech Recognition service; {0}".format(e))
    
    # recognize speech using Google Cloud Speech
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = r"""INSERT THE CONTENTS OF THE GOOGLE CLOUD SPEECH JSON CREDENTIALS FILE HERE"""
    try:
        print("Google Cloud Speech thinks you said " + r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS))
    except sr.UnknownValueError:
        print("Google Cloud Speech could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from Google Cloud Speech service; {0}".format(e))
    
    # recognize speech using Wit.ai
    WIT_AI_KEY = "INSERT WIT.AI API KEY HERE"  # Wit.ai keys are 32-character uppercase alphanumeric strings
    try:
        print("Wit.ai thinks you said " + r.recognize_wit(audio, key=WIT_AI_KEY))
    except sr.UnknownValueError:
        print("Wit.ai could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from Wit.ai service; {0}".format(e))
    
    # recognize speech using Microsoft Azure Speech
    AZURE_SPEECH_KEY = "INSERT AZURE SPEECH API KEY HERE"  # Microsoft Speech API keys 32-character lowercase hexadecimal strings
    try:
        print("Microsoft Azure Speech thinks you said " + r.recognize_azure(audio, key=AZURE_SPEECH_KEY))
    except sr.UnknownValueError:
        print("Microsoft Azure Speech could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from Microsoft Azure Speech service; {0}".format(e))
    
    # recognize speech using Microsoft Bing Voice Recognition
    BING_KEY = "INSERT BING API KEY HERE"  # Microsoft Bing Voice Recognition API keys 32-character lowercase hexadecimal strings
    try:
        print("Microsoft Bing Voice Recognition thinks you said " + r.recognize_bing(audio, key=BING_KEY))
    except sr.UnknownValueError:
        print("Microsoft Bing Voice Recognition could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from Microsoft Bing Voice Recognition service; {0}".format(e))
    
    # recognize speech using Houndify
    HOUNDIFY_CLIENT_ID = "INSERT HOUNDIFY CLIENT ID HERE"  # Houndify client IDs are Base64-encoded strings
    HOUNDIFY_CLIENT_KEY = "INSERT HOUNDIFY CLIENT KEY HERE"  # Houndify client keys are Base64-encoded strings
    try:
        print("Houndify thinks you said " + r.recognize_houndify(audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY))
    except sr.UnknownValueError:
        print("Houndify could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from Houndify service; {0}".format(e))
    
    # recognize speech using IBM Speech to Text
    IBM_USERNAME = "INSERT IBM SPEECH TO TEXT USERNAME HERE"  # IBM Speech to Text usernames are strings of the form XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
    IBM_PASSWORD = "INSERT IBM SPEECH TO TEXT PASSWORD HERE"  # IBM Speech to Text passwords are mixed-case alphanumeric strings
    try:
        print("IBM Speech to Text thinks you said " + r.recognize_ibm(audio, username=IBM_USERNAME, password=IBM_PASSWORD))
    except sr.UnknownValueError:
        print("IBM Speech to Text could not understand audio")
    except sr.RequestError as e:
        print("Could not request results from IBM Speech to Text service; {0}".format(e))
    

    希望这在某种程度上有所帮助!

    网页内容由stack overflow 提供, 点击上面的
    可以查看英文原文,
    原文链接