比较JavaScript字符串返回可能性的百分比

139

我正在寻找一个 JavaScript 函数,可以比较两个字符串并返回它们相似的可能性。我已经查看了 Soundex,但对于多词字符串或非名称并不是很好。我正在寻找像这样的函数:

    function compare(strA,strB){
    
    }
    
    compare("Apples","apple") = Some X Percentage.

这个函数可以处理所有类型的字符串,包括数字、多词值和名称。也许有一个简单的算法可以使用?

最终,这些都不适合我的目的,所以我使用了这个:

     function compare(c, u) {
            var incept = false;
            var ca = c.split(",");
            u = clean(u);
            //ca = correct answer array (Collection of all correct answer)
            //caa = a single correct answer word array (collection of words of a single correct answer)
            //u = array of user answer words cleaned using custom clean function
            for (var z = 0; z < ca.length; z++) {
                caa = $.trim(ca[z]).split(" ");
                var pc = 0;
                for (var x = 0; x < caa.length; x++) {
                    for (var y = 0; y < u.length; y++) {
                        if (soundex(u[y]) != null && soundex(caa[x]) != null) {
                            if (soundex(u[y]) == soundex(caa[x])) {
                                pc = pc + 1;
                            }
                        }
                        else {
                            if (u[y].indexOf(caa[x]) > -1) {
                                pc = pc + 1;
                            }
                        }
                    }
                }
                if ((pc / caa.length) > 0.5) {
                    return true;
                }
            }
            return false;
        }
        
        // create object listing the SOUNDEX values for each letter
        // -1 indicates that the letter is not coded, but is used for coding
        //  0 indicates that the letter is omitted for modern census archives
        //                              but acts like -1 for older census archives
        //  1 is for BFPV
        //  2 is for CGJKQSXZ
        //  3 is for DT
        //  4 is for L
        //  5 is for MN my home state
        //  6 is for R
        function makesoundex() {
            this.a = -1
            this.b = 1
            this.c = 2
            this.d = 3
            this.e = -1
            this.f = 1
            this.g = 2
            this.h = 0
            this.i = -1
            this.j = 2
            this.k = 2
            this.l = 4
            this.m = 5
            this.n = 5
            this.o = -1
            this.p = 1
            this.q = 2
            this.r = 6
            this.s = 2
            this.t = 3
            this.u = -1
            this.v = 1
            this.w = 0
            this.x = 2
            this.y = -1
            this.z = 2
        }
        
        var sndx = new makesoundex()
        
        // check to see that the input is valid
        function isSurname(name) {
            if (name == "" || name == null) {
                return false
            } else {
                for (var i = 0; i < name.length; i++) {
                    var letter = name.charAt(i)
                    if (!(letter >= 'a' && letter <= 'z' || letter >= 'A' && letter <= 'Z')) {
                        return false
                    }
                }
            }
            return true
        }
        
        // Collapse out directly adjacent sounds
        // 1. Assume that surname.length>=1
        // 2. Assume that surname contains only lowercase letters
        function collapse(surname) {
            if (surname.length == 1) {
                return surname
            }
            var right = collapse(surname.substring(1, surname.length))
            if (sndx[surname.charAt(0)] == sndx[right.charAt(0)]) {
                return surname.charAt(0) + right.substring(1, right.length)
            }
            return surname.charAt(0) + right
        }
        
        // Collapse out directly adjacent sounds using the new National Archives method
        // 1. Assume that surname.length>=1
        // 2. Assume that surname contains only lowercase letters
        // 3. H and W are completely ignored
        function omit(surname) {
            if (surname.length == 1) {
                return surname
            }
            var right = omit(surname.substring(1, surname.length))
            if (!sndx[right.charAt(0)]) {
                return surname.charAt(0) + right.substring(1, right.length)
            }
            return surname.charAt(0) + right
        }
        
        // Output the coded sequence
        function output_sequence(seq) {
            var output = seq.charAt(0).toUpperCase() // Retain first letter
            output += "-" // Separate letter with a dash
            var stage2 = seq.substring(1, seq.length)
            var count = 0
            for (var i = 0; i < stage2.length && count < 3; i++) {
                if (sndx[stage2.charAt(i)] > 0) {
                    output += sndx[stage2.charAt(i)]
                    count++
                }
            }
            for (; count < 3; count++) {
                output += "0"
            }
            return output
        }
        
        // Compute the SOUNDEX code for the surname
        function soundex(value) {
            if (!isSurname(value)) {
                return null
            }
            var stage1 = collapse(value.toLowerCase())
            //form.result.value=output_sequence(stage1);
        
            var stage1 = omit(value.toLowerCase())
            var stage2 = collapse(stage1)
            return output_sequence(stage2);
        
        }
        
        function clean(u) {
            var u = u.replace(/\,/g, "");
            u = u.toLowerCase().split(" ");
            var cw = ["ARRAY OF WORDS TO BE EXCLUDED FROM COMPARISON"];
            var n = [];
            for (var y = 0; y < u.length; y++) {
                var test = false;
                for (var z = 0; z < cw.length; z++) {
                    if (u[y] != "" && u[y] != cw[z]) {
                        test = true;
                        break;
                    }
                }
                if (test) {
        //Don't use & or $ in comparison
                    var val = u[y].replace("$", "").replace("&", "");
                    n.push(val);
                }
            }
            return n;
        }

当样本数量很大时,计算字符串相似度得分的有效方法是什么? - Andreas
我正在测试这个,仍然很难找到完美的一个。经典的例子就是这种情况。比如问题是“前两任总统是谁?”,有人回答“亚当斯和华盛顿”。与“乔治·华盛顿 约翰·亚当斯”进行字符串比较应该大约为50%。 - Brad Ruderman
1
oof,依赖于jQuery? - Shawn Whinnery
11个回答

231

这里是基于Levenshtein距离的答案https://en.wikipedia.org/wiki/Levenshtein_distance

function similarity(s1, s2) {
  var longer = s1;
  var shorter = s2;
  if (s1.length < s2.length) {
    longer = s2;
    shorter = s1;
  }
  var longerLength = longer.length;
  if (longerLength == 0) {
    return 1.0;
  }
  return (longerLength - editDistance(longer, shorter)) / parseFloat(longerLength);
}

计算编辑距离

function editDistance(s1, s2) {
  s1 = s1.toLowerCase();
  s2 = s2.toLowerCase();

  var costs = new Array();
  for (var i = 0; i <= s1.length; i++) {
    var lastValue = i;
    for (var j = 0; j <= s2.length; j++) {
      if (i == 0)
        costs[j] = j;
      else {
        if (j > 0) {
          var newValue = costs[j - 1];
          if (s1.charAt(i - 1) != s2.charAt(j - 1))
            newValue = Math.min(Math.min(newValue, lastValue),
              costs[j]) + 1;
          costs[j - 1] = lastValue;
          lastValue = newValue;
        }
      }
    }
    if (i > 0)
      costs[s2.length] = lastValue;
  }
  return costs[s2.length];
}

使用方法

similarity('Stack Overflow','Stack Ovrflw')

返回 0.8571428571428571


您可以在下面进行操作:

function checkSimilarity(){
  var str1 = document.getElementById("lhsInput").value;
  var str2 = document.getElementById("rhsInput").value;
  document.getElementById("output").innerHTML = similarity(str1, str2);
}

function similarity(s1, s2) {
      var longer = s1;
      var shorter = s2;
      if (s1.length < s2.length) {
        longer = s2;
        shorter = s1;
      }
      var longerLength = longer.length;
      if (longerLength == 0) {
        return 1.0;
      }
      return (longerLength - editDistance(longer, shorter)) / parseFloat(longerLength);
    }

    function editDistance(s1, s2) {
      s1 = s1.toLowerCase();
      s2 = s2.toLowerCase();

      var costs = new Array();
      for (var i = 0; i <= s1.length; i++) {
        var lastValue = i;
        for (var j = 0; j <= s2.length; j++) {
          if (i == 0)
            costs[j] = j;
          else {
            if (j > 0) {
              var newValue = costs[j - 1];
              if (s1.charAt(i - 1) != s2.charAt(j - 1))
                newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
              costs[j - 1] = lastValue;
              lastValue = newValue;
            }
          }
        }
        if (i > 0)
          costs[s2.length] = lastValue;
      }
      return costs[s2.length];
    }
<div><label for="lhsInput">String 1:</label> <input type="text" id="lhsInput" oninput="checkSimilarity()" /></div>
<div><label for="rhsInput">String 2:</label> <input type="text" id="rhsInput" oninput="checkSimilarity()" /></div>
<div>Match: <span id="output">No Input</span></div>


1
改进了几个单词:var similarity2 = function(s1, s2){ var split1 = s1.split(' '); var split2 = s2.split(' '); var sum = 0; var max = 0; var temp = 0; for(var i=0; i<split1.length;i++){ max = 0; for(var j=0; j<split2.length;j++){ temp = similarity(split1[i], split2[j]); if(max < temp) max = temp; } console.log(max); sum += max / split1.length;
} return sum; };
- infinito84
@overlord1234 上面的方法适用于像这样的字符串吗:9e27dbb9ff6eea70821c02b4457cbc6b7eb8e12a64f46c192c3a05f1bc1519acd101193dac157c6233d9d773f9b364ca210d6287f9efa00bfc656746782905be? - QWERTY
@hyperfkcb 正在实现编辑距离算法,该算法计算有多少个字符位置是错误的(多或少),因此为了计算百分比,他正在取最长可能的编辑距离值(longerLength)并执行 (longerLength - editDistance) / longerLength。 - infinito84
1
然而,对于长字符串来说速度太慢了。 - upupming
2
但它不适用于大写字母。它不区分大小写! - user10021033
显示剩余5条评论

26

使用 这个 函数库来计算字符串相似度对我非常有效!

以下是示例 -

var similarity = stringSimilarity.compareTwoStrings("Apples","apple");    // => 0.88

3
当我们在本地添加软件包时,会出现这种情况。但是,我们可以使用CDN来减少打包大小。以下是CDN链接 - https://www.jsdelivr.com/package/npm/lodash - https://www.jsdelivr.com/package/npm/string-similarity - Tushar Walzade
10
他们已经移除了大多数依赖项,包括 Lodash。 - Lovenkrands
这很奇怪,但是 "KIA" 和 "Kia" 都会产生 0.0。¯_(ツ)_/¯ - Nigrimmist
@Nigrimmist,您可以尝试将它们转换为小写字母。 - Tushar Walzade
@TusharWalzade 这是我做的,是的,谢谢 :) - Nigrimmist

16

这是一个非常简单的函数,它执行比较并根据等效性返回百分比。虽然它没有被测试过所有可能的场景,但它可能会帮助您入门。

function similar(a,b) {
    var equivalency = 0;
    var minLength = (a.length > b.length) ? b.length : a.length;    
    var maxLength = (a.length < b.length) ? b.length : a.length;    
    for(var i = 0; i < minLength; i++) {
        if(a[i] == b[i]) {
            equivalency++;
        }
    }
    

    var weight = equivalency / maxLength;
    return (weight * 100) + "%";
}
alert(similar("test","tes"));   // 75%
alert(similar("test","test"));  // 100%
alert(similar("test","testt")); // 80%
alert(similar("test","tess"));  // 75%

25
问题在于,“atest”和“test”都返回0%,但我们知道这不是真的。 - peresisUser

14

要找到两个字符串之间的相似度,我们可以使用一两种方法,但我更倾向于使用 'Dice's Coefficient',在我的知识范围内,它比使用 'Levenshtein distance' 更好。

使用 npm 的 'string-similarity' 包,您将能够处理我上面所说的内容。

以下是一些简单的使用示例:

var stringSimilarity = require('string-similarity');

var similarity = stringSimilarity.compareTwoStrings('healed', 'sealed'); 

var matches = stringSimilarity.findBestMatch('healed', ['edward', 'sealed', 'theatre']);

欲了解更多,请访问上面给出的链接。谢谢。


1
欢迎提供解决方案的链接,但请确保您的答案即使没有链接也是有用的:添加链接周围的上下文,以便您的同行用户了解它是什么以及为什么存在,然后引用您链接到的页面中最相关的部分,以防目标页面不可用。仅仅是一个链接的答案可能会被删除 - David Buck

8

我快速写了一个可能足够满足您需求的例子:

function Compare(strA,strB){
    for(var result = 0, i = strA.length; i--;){
        if(typeof strB[i] == 'undefined' || strA[i] == strB[i]);
        else if(strA[i].toLowerCase() == strB[i].toLowerCase())
            result++;
        else
            result += 4;
    }
    return 1 - (result + 4*Math.abs(strA.length - strB.length))/(2*(strA.length+strB.length));
}

此功能会将大小写不同但相同的字符的权重降低为完全不同或缺失的字符的1/4。它返回一个介于0和1之间的数字,其中1表示字符串相同,0表示它们没有任何相似之处。例如:
Compare("Apple", "Apple")    // 1
Compare("Apples", "Apple")   // 0.8181818181818181
Compare("Apples", "apple")   // 0.7727272727272727
Compare("a", "A")            // 0.75
Compare("Apples", "appppp")  // 0.45833333333333337
Compare("a", "b")            // 0

11
不太准确:比较("Apple", "zApple") = 0.07,而比较("Apple", "Applez") = 0.84。 Translated: Not very accurate: Compare("Apple", "zApple") = 0.07, while Compare("Apple", "Applez") = 0.84. - user172163
3
@Kousha,这是一个位置相关的问题。 "Apple"和"zApple"只有一个字母相同(第二个p)。 - Paul
2
@Paulpro,逻辑上Apple和zApple有五个字母是相同的。这是你实现的问题。Apple、zApple和Applez是相似的。 - user172163
2
@Kousha,根据这个算法,zApple并不相似,因为它是位置相关的。这并不意味着算法是错误的。 - Paul
11
@Paulpro:这并不意味着你的算法是错误的,但它并不适合作为这个问题的好答案... - MarcoS

6

来自PHP.js库similar_text函数如何?它基于一个同名的PHP函数

function similar_text (first, second) {
    // Calculates the similarity between two strings  
    // discuss at: http://phpjs.org/functions/similar_text

    if (first === null || second === null || typeof first === 'undefined' || typeof second === 'undefined') {
        return 0;
    }

    first += '';
    second += '';

    var pos1 = 0,
        pos2 = 0,
        max = 0,
        firstLength = first.length,
        secondLength = second.length,
        p, q, l, sum;

    max = 0;

    for (p = 0; p < firstLength; p++) {
        for (q = 0; q < secondLength; q++) {
            for (l = 0;
            (p + l < firstLength) && (q + l < secondLength) && (first.charAt(p + l) === second.charAt(q + l)); l++);
            if (l > max) {
                max = l;
                pos1 = p;
                pos2 = q;
            }
        }
    }

    sum = max;

    if (sum) {
        if (pos1 && pos2) {
            sum += this.similar_text(first.substr(0, pos2), second.substr(0, pos2));
        }

        if ((pos1 + max < firstLength) && (pos2 + max < secondLength)) {
            sum += this.similar_text(first.substr(pos1 + max, firstLength - pos1 - max), second.substr(pos2 + max, secondLength - pos2 - max));
        }
    }

    return sum;
}

1
相似度是基于匹配字符返回的吗?它如何评估相似度? - QWERTY

2
fuzzyset - 一个用于 JavaScript 的模糊字符串集合。fuzzyset 是一种数据结构,它可以对数据执行类似全文搜索的操作,以确定可能的拼写错误和近似字符串匹配。请注意,这是 Python 库的 JavaScript 移植版。

2
在某种程度上,我喜欢Dice's coefficient的想法,它嵌入了string-similarity模块中。但是,我认为仅考虑bigrams而不考虑它们的重复性会丢失一些重要的数据。下面是一个版本,它也处理了重复性,并且我认为总体实现更简单。我没有尝试使用他们的API,只提供了一个函数,该函数在比较两个字符串之前进行了一些操作(删除非字母数字字符,将所有内容转换为小写,并压缩但不删除空格),该函数建立在一个比较它们而不进行操作的函数之上。将其包装回到他们的API中很容易,但我认为这样做没有太大必要。

const stringSimilarity = (a, b) =>
  _stringSimilarity (prep (a), prep (b))

const _stringSimilarity = (a, b) => {
  const bg1 = bigrams (a)
  const bg2 = bigrams (b)
  const c1 = count (bg1)
  const c2 = count (bg2)
  const combined = uniq ([... bg1, ... bg2]) 
    .reduce ((t, k) => t + (Math .min (c1 [k] || 0, c2 [k] || 0)), 0)
  return 2 * combined / (Math .max (bg1 .length + bg2 .length, 1))
}

const prep = (str) => // TODO: unicode support?
  str .toLowerCase () .replace (/[^\w\s]/g, ' ') .replace (/\s+/g, ' ')

const bigrams = (str) => 
  [...str] .slice (0, -1) .map ((c, i) => c + str [i + 1])

const count = (xs) => 
  xs .reduce ((a, x) => ((a [x] = (a [x] || 0) + 1), a), {})

const uniq = (xs) => 
  [... new Set (xs)]

console .log (stringSimilarity (
  'foobar', 
  'Foobar'
)) //=> 1

console .log (stringSimilarity (
  "healed", 
  "sealed"
))//=> 0.8

console .log (stringSimilarity (
  "Olive-green table for sale, in extremely good condition.",
  "For sale: table in very good  condition, olive green in colour."
)) //=> 0.7787610619469026

console .log (stringSimilarity (
  "Olive-green table for sale, in extremely good condition.",
  "For sale: green Subaru Impreza, 210,000 miles"
)) //=> 0.38636363636363635

console .log (stringSimilarity (
  "Olive-green table for sale, in extremely good condition.",
  "Wanted: mountain bike with at least 21 gears."
)) //=> 0.1702127659574468

console .log (stringSimilarity (
  "The rain in Spain falls mainly on the plain.",
  "The run in Spun falls munly on the plun.",
)) //=> 0.7560975609756098

console .log (stringSimilarity (
  "Fa la la la la, la la la la",
  "Fa la la la la, la la",
)) //=> 0.8636363636363636

console .log (stringSimilarity (
  "car crash",
  "carcrash",
)) //=> 0.8

console .log (stringSimilarity (
  "Now is the time for all good men to come to the aid of their party.",
  "Huh?",
)) //=> 0
.as-console-wrapper {max-height: 100% !important; top: 0}

一些测试用例来自于string-similarity,另外一些是我自己写的。它们显示出了与该软件包的一些显著差异,但没有出现异常情况。唯一需要指出的是,"car crash""carcrash"之间的差异,string-similarity将它们视为相同,而我报告的相似度为0.8。我的版本在所有橄榄绿色的测试用例中找到了更多的相似性,而不像string-similarity那样,在任何情况下都是相当任意的数字,我不确定这会有多大的区别;它们肯定会把它们放在相同的相对顺序中。

2

以下是 string-similarity 库与最佳答案(by @overloard1234)性能比较

根据@Tushar Walzade的建议使用string-similarity库,你会发现,例如

stringSimilatityLib.findBestMatch('KIA','Kia').bestMatch.rating 

将返回0.0

因此,看起来最好将其转换为小写进行比较。

数组的更好基本用法:

findBestMatch(str, strArr) {
   const lowerCaseArr = strArr.map(element => element.toLowerCase());//creating lower case array
   const match = stringSimilatityLib.findBestMatch(str.toLowerCase(), lowerCaseArr).bestMatch; //trying to find bestMatch
   if (match.rating > 0) {
      const foundIndex = lowerCaseArr.findIndex(x => x === match.target); //finding the index of found best case
      return strArr[foundIndex]; //returning initial value from array
   }
    return null;
},

性能

我还比较了这里的最佳答案(由@overloard1234制作)和string-similarity库(v4.0.4)。

您可以在此处找到结果:https://jsbench.me/szkzojoskq/1

Perf tests

结果:string-similarity速度大约是原来的两倍

只是为了好玩:string-similarity库的v2.0比最新的4.0.4慢了约2.2倍。因此,如果您仍在使用< 3.0,请更新它 :)


0

我使用了@overlord1234的函数,但是我更正了ь: '',因为英语单词中没有这个字母,接下来需要用return a[char] ?? char代替return a[char] || char


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接