这里提到的字符串相似性指标都可以工作。然而,以非常具体的方式进一步规范化文本可以带来更好的结果。
TL;DR
从我在语音识别和手写识别方面的经验来看,我认为您的问题最好通过使用文本规范化(已存档)后跟一个词错误率(已存档)来解决。
这个过程的一个很好、快速的概述可以在Amazon AWS机器学习博客上关于评估语音识别的文章(已存档)中找到。做这个过程(规范化和评分)的一个很好(有点标准)的工具是NIST的SCTK。首先使用rfilter1
进行文本规范化,然后使用sclite
得到分数。根据分数确定哪些字符串您认为匹配。
更多细节
我认为有三个研究/应用领域的问题与您的问题非常相似。它们是:1) 语音识别 (已存档) (一个“缩写和其他小细节可能不同”的领域);以及在相关解决方案中的2) 光学字符识别 (已存档) 和3) 手写识别 (已存档) (两个领域“数据由人类输入,其中缩写和其他小细节可能不同”)。
查看自动或人工转录/识别的评分以及在任何这种转录中搜索字符串的问题,非常有用。根据我在这些领域的经验,最好的相似度比较来自于使用单词而不是字符来查找编辑距离的Levenshtein Distance(已归档);这被称为单词错误率。对文本进行规范化,包括大小写、标点符号和缩写等内容,可以使比较更好。
一个快速的例子
看起来你正在使用C
或C ++
。 sclite
和rfilter1
大多数是用C
编写的。我希望这个示例使用bash
+sclite
就足够了。
law.glm
的内容,是一个非常简单的GLM
文件(全局映射文件,即搜索和替换规则对)
;;
* name "law.glm"
* desc "Showing extra normalization"
* format = "NIST1" ;; other option is NIST2
* max_nrules = "1000" ;; allocating space (I can update this if necessary)
* copy_no_hit = "T" ;; don't ignore the line if there isn't a match
* case_sensitive = "F"
. => / __ [ ] ;; changes only if there's a space after it
, => / __ [ ]
? => / __ [ ]
! => / __ [ ]
versus => v / [ ] __ [ ] ;; changes only if there's a space before & after
vs => v / [ ] __ [ ]
& => and / [ ] __ [ ]
llp => limited liability partnership / [ ] __ [ ]
llc => limited liability company / [ ] __ [ ]
it's => it is / [ ] __ [ ]
shoppe => shop / [ ] __ [ ]
mister => Mr / [ ] __ [ ]
现在,在
bash
中。
$ first="Henry C. Harper v. The Law Offices of Huey & Luey, LLP (spk1_1)"
$ second="Harper v. The Law Offices of Huey & Luey, LLP (spk1_1)"
$ echo "${first}" > first.txt
$ echo "${second}" > second.txt
$ rfilter1 law.glm < first.txt > first_glm_normalized.txt
$ tr [A-Z] [a-z] < first_glm_normalized.txt > first_normalized.txt
$ rfilter1 law.glm < second.txt > second_glm_normalized.txt
$ tr [A-Z] [a-z] < second_glm_normalized.txt > second_normalized.txt
$ sclite -r first_normalized.txt -h second_normalized.txt -i rm -o snt stdout
===============================================================================
SENTENCE LEVEL REPORT FOR THE SYSTEM:
Name: second_normalized.txt
===============================================================================
SPEAKER spk1
id: (spk1_1)
Scores: (
REF: HENRY C harper v the law offices of huey and luey limited liability partnership
HYP: ***** * harper v the law offices of huey and luey limited liability partnership
Eval: D D
Correct = 85.7% 12 ( 12)
Substitutions = 0.0% 0 ( 0)
Deletions = 14.3% 2 ( 2)
Insertions = 0.0% 0 ( 0)
Errors = 14.3% 2 ( 2)
Ref. words = 14 ( 14)
Hyp. words = 12 ( 12)
Aligned words = 14 ( 14)
-------------------------------------------------------------------------------
$
所以,这是一个14.3%的词错误率。
现在,让我们来看一个不应该匹配的法律案例名称。
$ third="Larry Viola versus The Law Office of Mister Scrooge McDuck, Limited Liability Corporation (spk1_1)"
$ echo "${third}" > third.txt
$ rfilter1 law.glm < third.txt > third_glm_normalized.txt
$ tr [A-Z] [a-z] < third_glm_normalized.txt > third_normalized.txt
$ sclite -r first_normalized.txt -h third_normalized.txt -i rm -o snt stdout
$ sclite -r first_normalized.txt -h third_normalized.txt -i rm -o snt stdout ===============================================================================
SENTENCE LEVEL REPORT FOR THE SYSTEM:
Name: third_normalized.txt
===============================================================================
SPEAKER spk1
id: (spk1_1)
Scores: (
REF: HENRY C HARPER v the law OFFICES of HUEY AND LUEY limited liability PARTNERSHIP
HYP: ***** LARRY VIOLA v the law OFFICE of MR SCROOGE MCDUCK limited liability CORPORATION
Eval: D S S S S S S S
Correct = 42.9% 6 ( 6)
Substitutions = 50.0% 7 ( 7)
Deletions = 7.1% 1 ( 1)
Insertions = 0.0% 0 ( 0)
Errors = 57.1% 8 ( 8)
Ref. words = 14 ( 14)
Hyp. words = 13 ( 13)
Aligned words = 14 ( 14)
-------------------------------------------------------------------------------
$
您可能需要将一些字符串通过评分(比较)过程,以得出在哪里将True
与False
分开的启发式。