在Python中按单词拆分语音音频文件

36

我觉得这是一个比较常见的问题,但我还没有找到合适的答案。我有许多人类语音的音频文件,我想将它们分成单个单词,可以通过查看波形中的暂停来启发式地完成,但是有没有人能指向Python中自动执行此操作的函数/库?


3
您正在寻找SpeechRecognition,该库专门提供了一个将音频文件转录成文字的示例。下次可以先使用谷歌搜索 :) - Akshat Mahajan
7
我希望您能提供一个功能,可以将音频文件根据单词进行分割,而不是转录。虽然这个功能在转录中可能隐含在内,但并不是同一件事情。我熟悉SpeechRecognition包。 - user3059201
3
真实语音中单词之间没有边界,你说“how are you”是一个整体,没有任何声学提示。如果要分解成单词,需要进行转录。 - Nikolay Shmyrev
4
那并不完全正确。如果你观察任何演讲的波形图,就能明显看出单词和停顿的位置。 - user3059201
2
对于大多数口语,词汇单元之间的边界很难确定...人们可能期望许多书面语言使用的单词间空格...在其口语版本中对应于停顿,但这只在非常慢的语速下是正确的,当说话者有意插入这些停顿时。在正常的语音中,通常会发现许多连续的单词没有停顿,而且经常一个单词的最后一个音与下一个单词的初始音平滑地融合在一起。 - Nikolay Shmyrev
5个回答

46

一个更简单的方法是使用pydub模块。最近添加的静音工具承担了所有繁重的工作,例如设置静音阈值设置静音长度等,相比于其他提到的方法大大简化了代码。

以下是演示实现,灵感来自这里

设置:

我有一个包含从AZ的英文字母的音频文件,“a-z.wav”。在当前工作目录中创建了一个子目录splitAudio。执行演示代码后,这些文件被分成26个单独的文件,每个音频文件存储每个音节。

观察结果: 有些音节被切断了,可能需要修改以下参数:
min_silence_len=500
silence_thresh=-16

您可能需要根据自己的要求进行调整。

演示代码:

from pydub import AudioSegment
from pydub.silence import split_on_silence

sound_file = AudioSegment.from_wav("a-z.wav")
audio_chunks = split_on_silence(sound_file, 
    # must be silent for at least half a second
    min_silence_len=500,

    # consider it silent if quieter than -16 dBFS
    silence_thresh=-16
)

for i, chunk in enumerate(audio_chunks):

    out_file = ".//splitAudio//chunk{0}.wav".format(i)
    print "exporting", out_file
    chunk.export(out_file, format="wav")

输出:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>> 
exporting .//splitAudio//chunk0.wav
exporting .//splitAudio//chunk1.wav
exporting .//splitAudio//chunk2.wav
exporting .//splitAudio//chunk3.wav
exporting .//splitAudio//chunk4.wav
exporting .//splitAudio//chunk5.wav
exporting .//splitAudio//chunk6.wav
exporting .//splitAudio//chunk7.wav
exporting .//splitAudio//chunk8.wav
exporting .//splitAudio//chunk9.wav
exporting .//splitAudio//chunk10.wav
exporting .//splitAudio//chunk11.wav
exporting .//splitAudio//chunk12.wav
exporting .//splitAudio//chunk13.wav
exporting .//splitAudio//chunk14.wav
exporting .//splitAudio//chunk15.wav
exporting .//splitAudio//chunk16.wav
exporting .//splitAudio//chunk17.wav
exporting .//splitAudio//chunk18.wav
exporting .//splitAudio//chunk19.wav
exporting .//splitAudio//chunk20.wav
exporting .//splitAudio//chunk21.wav
exporting .//splitAudio//chunk22.wav
exporting .//splitAudio//chunk23.wav
exporting .//splitAudio//chunk24.wav
exporting .//splitAudio//chunk25.wav
exporting .//splitAudio//chunk26.wav
>>> 

使用这种方法的单词之间应该有明显的间隔。 - pouya
是的,我自己也一直在研究这个问题,但正如“pouya”所提到的,pydub或pyaudioanalysis只有在单词之间存在巨大的间隔时才能起作用,而在任何实际情况下都不会出现这种情况!问题还存在于相反的方向,如果说话者不是母语人士并且需要时间来发音某些单词,则某些单词可能会被分成音节。 - Deepak Agarwal

4

使用IBM STT。使用timestamps=true,您将获得单词拆分以及系统检测到其被发音的时间。

还有很多其他很酷的功能,比如word_alternatives_threshold可以获得其他可能的单词,word_confidence可以获得系统预测单词的置信度。将word_alternatives_threshold设置为0.1到0.01之间以获得实际想法。

这需要登录,然后您可以使用生成的用户名和密码。

IBM STT已经是所提到的speechrecognition模块的一部分,但要获取单词时间戳,您需要修改函数。

一个提取并修改过的形式如下:

def extracted_from_sr_recognize_ibm(audio_data, username=IBM_USERNAME, password=IBM_PASSWORD, language="en-US", show_all=False, timestamps=False,
                                word_confidence=False, word_alternatives_threshold=0.1):
    assert isinstance(username, str), "``username`` must be a string"
    assert isinstance(password, str), "``password`` must be a string"

    flac_data = audio_data.get_flac_data(
        convert_rate=None if audio_data.sample_rate >= 16000 else 16000,  # audio samples should be at least 16 kHz
        convert_width=None if audio_data.sample_width >= 2 else 2  # audio samples should be at least 16-bit
    )
    url = "https://stream-fra.watsonplatform.net/speech-to-text/api/v1/recognize?{}".format(urlencode({
        "profanity_filter": "false",
        "continuous": "true",
        "model": "{}_BroadbandModel".format(language),
        "timestamps": "{}".format(str(timestamps).lower()),
        "word_confidence": "{}".format(str(word_confidence).lower()),
        "word_alternatives_threshold": "{}".format(word_alternatives_threshold)
    }))
    request = Request(url, data=flac_data, headers={
        "Content-Type": "audio/x-flac",
        "X-Watson-Learning-Opt-Out": "true",  # prevent requests from being logged, for improved privacy
    })
    authorization_value = base64.standard_b64encode("{}:{}".format(username, password).encode("utf-8")).decode("utf-8")
    request.add_header("Authorization", "Basic {}".format(authorization_value))

    try:
        response = urlopen(request, timeout=None)
    except HTTPError as e:
        raise sr.RequestError("recognition request failed: {}".format(e.reason))
    except URLError as e:
        raise sr.RequestError("recognition connection failed: {}".format(e.reason))
    response_text = response.read().decode("utf-8")
    result = json.loads(response_text)

    # return results
    if show_all: return result
    if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
        raise Exception("Unknown Value Exception")

    transcription = []
    for utterance in result["results"]:
        if "alternatives" not in utterance:
            raise Exception("Unknown Value Exception. No Alternatives returned")
        for hypothesis in utterance["alternatives"]:
            if "transcript" in hypothesis:
                transcription.append(hypothesis["transcript"])
    return "\n".join(transcription)

3
你可以查看Audiolab,它提供了一个不错的API将语音样本转换为numpy数组。Audiolab模块使用libsndfile C++库来完成繁重的工作。然后,您可以解析数组以查找较低的值以找到暂停。

2

pyAudioAnalysis可以对音频文件进行分段,但前提是单词必须清晰分离(在自然语言中很少出现这种情况)。该软件包使用相对简单:


python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i SPEECH_AUDIO_FILE_TO_SPLIT.mp3 --smoothing 1.0 --weight 0.3

更多详细信息请查看我的博客。原始答案翻译成为“最初的回答”。

1

这是我的函数变体,可能更容易根据您的需求进行修改:

from scipy.io.wavfile import write as write_wav
import numpy as np
import librosa

def zero_runs(a):
    iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges

def split_in_parts(audio_path, out_dir):
    # Some constants
    min_length_for_silence = 0.01 # seconds
    percentage_for_silence = 0.01 # eps value for silence
    required_length_of_chunk_in_seconds = 60 # Chunk will be around this value not exact
    sample_rate = 16000 # Set to None to use default

    # Load audio
    waveform, sampling_rate = librosa.load(audio_path, sr=sample_rate)

    # Create mask of silence
    eps = waveform.max() * percentage_for_silence
    silence_mask = (np.abs(waveform) < eps).astype(np.uint8)

    # Find where silence start and end
    runs = zero_runs(silence_mask)
    lengths = runs[:, 1] - runs[:, 0]

    # Left only large silence ranges
    min_length_for_silence = min_length_for_silence * sampling_rate
    large_runs = runs[lengths > min_length_for_silence]
    lengths = lengths[lengths > min_length_for_silence]

    # Mark only center of silence
    silence_mask[...] = 0
    for start, end in large_runs:
        center = (start + end) // 2
        silence_mask[center] = 1

    min_required_length = required_length_of_chunk_in_seconds * sampling_rate
    chunks = []
    prev_pos = 0
    for i in range(min_required_length, len(waveform), min_required_length):
        start = i
        end = i + min_required_length
        next_pos = start + silence_mask[start:end].argmax()
        part = waveform[prev_pos:next_pos].copy()
        prev_pos = next_pos
        if len(part) > 0:
            chunks.append(part)

    # Add last part of waveform
    part = waveform[prev_pos:].copy()
    chunks.append(part)
    print('Total chunks: {}'.format(len(chunks)))

    new_files = []
    for i, chunk in enumerate(chunks):
        out_file = out_dir + "chunk_{}.wav".format(i)
        print("exporting", out_file)
        write_wav(out_file, sampling_rate, chunk)
        new_files.append(out_file)

    return new_files

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接