从文本中提取关键短语(1-4个单词的组合)

11
什么是从文本块中提取关键词短语的最佳方法?我正在编写一个关键字提取工具:类似于这样的。我找到了一些用于提取n-gram的Python和Perl库,但是我正在Node中编写代码,所以需要JavaScript解决方案。如果没有现成的JavaScript库,有人能否解释一下如何做到这一点,以便我可以自己编写它?
3个回答

19

我喜欢这个想法,所以我已经实现了它:请看下面(包含描述性注释)。
预览请访问:https://jsfiddle.net/WsKMx

/*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (https://dev59.com/Fmw05IYBdhLWcg3wsz0l)
 * Modified on 17 juli 2012, fixed IE bug by replacing [,] with [null]
 * This script will calculate words. For the simplicity and efficiency,
 * there's only one loop through a block of text.
 * A 100% accuracy requires much more computing power, which is usually unnecessary
 **/


var text = "A quick brown fox jumps over the lazy old bartender who said 'Hi!' as a response to the visitor who presumably assaulted the maid's brother, because he didn't pay his debts in time. In time in time does really mean in time. Too late is too early? Nonsense! 'Too late is too early' does not make any sense.";

var atLeast = 2;       // Show results with at least .. occurrences
var numWords = 5;      // Show statistics for one to .. words
var ignoreCase = true; // Case-sensitivity
var REallowedChars = /[^a-zA-Z'\-]+/g;
 // RE pattern to select valid characters. Invalid characters are replaced with a whitespace

var i, j, k, textlen, len, s;
// Prepare key hash
var keys = [null]; //"keys[0] = null", a word boundary with length zero is empty
var results = [];
numWords++; //for human logic, we start counting at 1 instead of 0
for (i=1; i<=numWords; i++) {
    keys.push({});
}

// Remove all irrelevant characters
text = text.replace(REallowedChars, " ").replace(/^\s+/,"").replace(/\s+$/,"");

// Create a hash
if (ignoreCase) text = text.toLowerCase();
text = text.split(/\s+/);
for (i=0, textlen=text.length; i<textlen; i++) {
    s = text[i];
    keys[1][s] = (keys[1][s] || 0) + 1;
    for (j=2; j<=numWords; j++) {
        if(i+j <= textlen) {
            s += " " + text[i+j-1];
            keys[j][s] = (keys[j][s] || 0) + 1;
        } else break;
    }
}

// Prepares results for advanced analysis
for (var k=1; k<=numWords; k++) {
    results[k] = [];
    var key = keys[k];
    for (var i in key) {
        if(key[i] >= atLeast) results[k].push({"word":i, "count":key[i]});
    }
}

// Result parsing
var outputHTML = []; // Buffer data. This data is used to create a table using `.innerHTML`

var f_sortAscending = function(x,y) {return y.count - x.count;};
for (k=1; k<numWords; k++) {
    results[k].sort(f_sortAscending);//sorts results
    
    // Customize your output. For example:
    var words = results[k];
    if (words.length) outputHTML.push('<td colSpan="3" class="num-words-header">'+k+' word'+(k==1?"":"s")+'</td>');
    for (i=0,len=words.length; i<len; i++) {
        
        //Characters have been validated. No fear for XSS
        outputHTML.push("<td>" + words[i].word + "</td><td>" +
           words[i].count + "</td><td>" +
           Math.round(words[i].count/textlen*10000)/100 + "%</td>");
           // textlen defined at the top
           // The relative occurence has a precision of 2 digits.
    }
}
outputHTML = '<table id="wordAnalysis"><thead><tr>' +
              '<td>Phrase</td><td>Count</td><td>Relativity</td></tr>' +
              '</thead><tbody><tr>' +outputHTML.join("</tr><tr>")+
               "</tr></tbody></table>";
document.getElementById("RobW-sample").innerHTML = outputHTML;
/*
CSS:
#wordAnalysis td{padding:1px 3px 1px 5px}
.num-words-header{font-weight:bold;border-top:1px solid #000}

HTML:
<div id="#RobW-sample"></div>
*/

我已经更新了代码以修复IE8中的一个错误。这个错误是通过邮件报告的,我在这里粘贴了邮件和我的回复(其中包括修复方法和详细说明):http://pastebin.com/7Edx88Gp。 - Rob W
美丽的是,几年后你仍在帮助人们。 - Carlos Argueta
最好能够排除所谓的停用词,例如:the、a、they、is等。 - Andrew Anderson

0
function ngrams(seq, n) {
  to_return = []
  for (let i=0; i<seq.length-(n-1); i++) {
      let cur = []
      for (let j=i; j<seq.length && j<=i+(n-1); j++) {
          cur.push(seq[j])
      }
      to_return.push(cur.join(''))
  }
  return to_return
}

> ngrams(['a', 'b', 'c'], 2)
['ab', 'bc']

0

我不知道JavaScript中是否有这样的库,但逻辑是:

  1. 将文本拆分为数组
  2. 然后排序和计数

或者

  1. 拆分为数组
  2. 创建一个辅助数组
  3. 遍历第一个数组的每个项目
  4. 检查当前项目是否存在于辅助数组中
  5. 如果不存在,则将其作为项目键推送
  6. 否则,增加具有与所需项相等的键的值。希望对你有帮助。

Ivo Stoykov


这并不是我想要的,因为它不能提取多个单词的n-gram...它只适用于单个单词。 - Carter Cole
1
请查看此链接 -> http://valuetype.wordpress.com/2011/08/24/keyword-density-with-javascript/,这是一个单词计数的示例,但可以轻松扩展到3或4个单词。 - i100

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接