使用不同比特率和/或不同ID3标签的MP3文件如何检测重复？

Question

使用不同比特率和/或不同ID3标签的MP3文件如何检测重复？

pythonfilemp3duplicatesid3

14

我应该如何检测重复的MP3文件，这些文件可能使用不同的比特率进行编码（但是它们是同一首歌），并且ID3标签可能不正确？最好用Python进行检测。

我知道可以对文件内容进行MD5校验，但是这对于不同比特率的文件无效。我不知道ID3标签是否会影响生成MD5校验和。我是否应该重新编码具有不同比特率的MP3文件，然后再进行校验？您有什么建议吗？

- Antonio Melé

这是一个易于使用的Python库，可以完美地完成这个任务：https://github.com/worldveil/dejavu - lollercoaster

10个回答

4

像其他人所说的那样，简单的校验和无法检测具有不同比特率或ID3标签的重复项。您需要的是一种音频指纹算法。Python音频处理套件具有这样的算法，但我不能保证它的可靠性。

http://rudd-o.com/new-projects/python-audioprocessing

- piquadrat

3

对于标签问题，Picard可能是一个非常好的选择。如果你想从两个潜在重复的文件中提取比特率信息，可以看看mp3guessenc。

- huitseeker

2

Dejavu项目使用Python编写，能够完美地满足您的要求。https://github.com/worldveil/dejavu此外，它还支持许多常见格式（.wav、.mp3等）以及在原始音频中查找剪辑的时间偏移量。

- lollercoaster

2

我认为简单的校验和永远不会起作用：

ID3标签会影响md5。
不同的编码器会以不同的方式对相同的歌曲进行编码 - 因此校验和将不同。
不同的比特率将产生不同的校验和。
重新对mp3进行不同比特率的重新编码可能会听起来很糟糕，而且肯定与一步压缩的原始音频不同。

我认为，您需要比较ID3标签、歌曲长度和文件名。

- Douglas Leeder

2

重新以相同的比特率进行编码是行不通的，事实上，这可能会使情况变得更糟，因为转码（即以不同比特率重新编码）将改变压缩的性质，重新压缩已经压缩过的文件会导致文件发生显著变化。

这有点超出了我的能力范围，但我会通过查看MP3的波形图来解决问题。可以通过将MP3转换为未压缩的.wav文件或仅在MP3文件本身上运行分析来完成。应该有一个相关的库可用。只是提醒一句，这是一个昂贵的操作。

另一个想法是使用ReplayGain扫描文件。如果它们是同一首歌，它们应该被标记为相同的增益。这仅适用于来自完全相同专辑的确切相同的歌曲。我知道有几种情况，重新发行的专辑会以更高的音量进行重新制作，从而改变回放增益。

编辑：
你可能想要查看http://www.speech.kth.se/snack/，它似乎可以进行声谱图可视化。我想任何可以可视化声谱图的库都可以帮助您进行比较。

这个来自官方Python页面的链接也可能会有所帮助。

- nemo

1

你可以使用继承自PUID和MusicBrainz的AcoustiD进行操作：AcoustiD：

AcoustID是一个开源项目，旨在创建一个带有映射到MusicBrainz元数据数据库的音频指纹自由数据库，并提供一个基于该数据库的音频文件识别Web服务...

...将指纹与一些必要的元数据一起传输到AcoustID数据库中以识别歌曲...

您可以在https://acoustid.org/找到各种客户端库和Web服务示例。

- PeterCo

1

我会将长度作为我的主要启发式。这就是当iTunes尝试使用Gracenote数据库识别CD时所做的。应该使用毫秒而不是秒来测量长度。记住，这只是一种启发式：在删除它们之前，您应该确实听取任何检测到的重复。

- splicer

1

我正在寻找类似的东西，然后我发现了这个：
http://www.lastfm.es/user/nova77LF/journal/2007/10/12/4kaf_fingerprint_(command_line)_client

希望能对你有所帮助。

- Menda

0

首先，您需要将它们解码为PCM，并确保具有特定的采样率，您可以事先选择（例如16KHz）。您需要重新采样具有不同采样率的歌曲。高采样率并不是必需的，因为您需要进行模糊比较，但如果采样率太低，将会丢失太多细节。

您可以使用以下代码进行操作：

ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav

以下是使用Python从两个音频文件中获取0到100相似度数字的代码，它通过从音频文件生成指纹并使用互相关进行比较来工作。

它需要安装Chromaprint和FFMPEG，另外它不适用于短音频文件，如果有问题，可以像这个指南一样减慢音频的速度，但要注意这会增加一些噪音。

# correlation.py
import subprocess
import numpy
# seconds to sample audio file for
sample_time = 500# number of points to scan cross correlation over
span = 150# step size (in points) of cross correlation
step = 1# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
    fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
    fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
    # convert fingerprint to list of integers
    fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(',')))      
    return fingerprints  
    # returns correlation between lists
def correlation(listx, listy):
    if len(listx) == 0 or len(listy) == 0:
        # Error checking in main program should prevent us from ever being
        # able to get here.     
        raise Exception('Empty lists cannot be correlated.')    
    if len(listx) > len(listy):     
        listx = listx[:len(listy)]  
    elif len(listx) < len(listy):       
        listy = listy[:len(listx)]      

    covariance = 0  
    for i in range(len(listx)):     
        covariance += 32 - bin(listx[i] ^ listy[i]).count("1")  
    covariance = covariance / float(len(listx))     
    return covariance/32  
    # return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):    
    if offset > 0:      
        listx = listx[offset:]      
        listy = listy[:len(listx)]  
    elif offset < 0:        
        offset = -offset        
        listy = listy[offset:]      
        listx = listx[:len(listy)]  
    if min(len(listx), len(listy)) < min_overlap:       
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        return   
    #raise Exception('Overlap too small: %i' % min(len(listx), len(listy))) 
    return correlation(listx, listy)  
    # cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):  
    if span > min(len(listx), len(listy)):      
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.')

    corr_xy = []    
    for offset in numpy.arange(-span, span + 1, step):      
        corr_xy.append(cross_correlation(listx, listy, offset)) 
    return corr_xy  
    # return index of maximum value in list
def max_index(listx):   
    max_index = 0   
    max_value = listx[0]    
    for i, value in enumerate(listx):       
        if value > max_value:           
            max_value = value           
            max_index = i   
    return max_index  

def get_max_corr(corr, source, target): 
    max_corr_index = max_index(corr)    
    max_corr_offset = -span + max_corr_index * step 
    print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
    # report matches    
    if corr[max_corr_index] > threshold:        
        print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset))) 

def correlate(source, target):  
    fingerprint_source = calculate_fingerprints(source) 
    fingerprint_target = calculate_fingerprints(target)     
    corr = compare(fingerprint_source, fingerprint_target, span, step)  
    max_corr_offset = get_max_corr(corr, source, target)  

if __name__ == "__main__":    
    correlate(SOURCE_FILE, TARGET_FILE)

代码从https://shivama205.medium.com/audio-signals-comparison-23e431ed2207转换为Python 3。

现在你需要添加一个阈值，例如90%，如果超过了这个阈值，它就会被认为是重复的。

- Alejandro Garcia

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ΤΖΩΤΖΙΟΥ · Accepted Answer

这是自旧的AudioScrobbler团队和目前MusicBrainz团队一直在研究的确切问题。目前，可以帮助你完成任务的Python项目是Picard，它将为音频文件（不仅限于MPEG 1 Layer 3文件）打上GUID标签（实际上是几个标签），从此以后，匹配标签就变得非常简单。

如果您想自己开展这个项目，libofa可能会有所帮助。