在数组中计算出现次数和获取前几个值(词袋)的简单方法

4

你好,我一直在寻找一种用JavaScript开发简单词袋模型的方法,并花费了时间查看了一些示例,但是大多数都需要安装jnode或browserify。 我试图简单地读取文本,将其拆分,并获取文本中最常用的单词,但是我在使用JavaScript的数组对象返回文本值时出现了问题,到目前为止,我只能返回编号索引:

function bagOfWords(text){
text=text.toLowerCase(); //make everything lower case
var bag = text.split(" "); //remove blanks

//count duplicates 
var map = bag.reduce(function(prev, cur) {
  prev[cur] = (prev[cur] || 0) + 1;
  return prev;
}, {});


var arr = Object.keys( map ).map(function ( key ) { return map[key]; }); //index based on values to find top 10 possible tags
arr=arr.sort(sortNumber); //sort the numbered array

var top10 = new Array(); //the final array storing the top 10 elements
for (i = arr.length; top10.length < 10; i--) { 
if(top10.length<10){
top10.push(arr[i]);}

}

}

有没有更简单的方法使用reduce方法来查找、计数和搜索前10个单词,而不必迭代索引并引用原始文本输入(不创建新的排序数组)?


1
不要使用数组,而是使用映射...(不一定是ES6 Mapvar map = {},并将当前单词用作映射键var count = map[word]; if (count === undefined) count = 1; else count += 1; map[word] = count;。然而,这种方式需要迭代映射中的所有内容才能找到最高计数。 - Stephen P
啊,好主意,谢谢你的所有帮助! - D3181
4个回答

2
我不知道是否有一个好的“reduce”解决方案来解决这个问题,但是我想出了一个算法:
  1. 对所有单词进行排序,并克隆此数组。
  2. 按出现顺序的相反顺序对排序后的单词列表进行排序,使用克隆数组上的lastIndexOf()indexOf()
  3. 使用filter()从新数组中删除重复项。
  4. 使用slice()将过滤后的数组限制为前10个单词。

片段:

function bagOfWords(text) {
  var bag = text.
              toLowerCase().
              split(' ').
              sort(),
      clone = bag.slice();  //needed because sort changes the array in place
          
  return bag.
           sort(function(a, b) { //sort in reverse order of occurrence
          return (clone.lastIndexOf(b) - clone.indexOf(b) + 1) -
                  (clone.lastIndexOf(a) - clone.indexOf(a) + 1);
        }).
           filter(function(word, idx) { //remove duplicates
             return bag.indexOf(word) === idx;
           }).
           slice(0, 10);  //first 10 elements
} //bagOfWords

console.log(bagOfWords('four eleven two eleven ten nine one six seven eleven nine ten seven four seven six eleven nine five ten seven six eleven nine seven three five ten eleven six nine two five seven ten eleven nine six three eight eight eleven nine ten eight three eight five eleven eight ten nine four four eight eleven ten five eight six seven eight nine ten ten eleven '));

console.log(bagOfWords('Four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition that all men are created equal Now we are engaged in a great civil war testing whether that nation or any nation so conceived and so dedicated can long endure We are met on a great battle-field of that war We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live It is altogether fitting and proper that we should do this But in a larger sense we can not dedicate we can not consecrate we can not hallow this ground The brave men living and dead who struggled here have consecrated it far above our poor power to add or detract The world will little note nor long remember what we say here but it can never forget what they did here It is for us the living rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced It is rather for us to be here dedicated to the great task remaining before us that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion that we here highly resolve that these dead shall not have died in vain that this nation under God shall have a new birth of freedom and that government of the people by the people for the people shall not perish from the earth'));


2
你可以使用String.prototype.match()Array.prototype.some()来排除结果中的重复对象,Array.protototype.slice()并设置参数0, 10以返回出现相同单词最多的前十个项目。

var text = document.querySelector("div").textContent;

var res = text.match(/[a-z]+/ig).reduce((arr, word) => {
    return !arr.some(w => w.word === word) 
           ? [...arr, {
              word: word,
              len: text.match(new RegExp("\\b(" + word + ")\\b", "g")).length
             }] 
           : arr
}, [])
.sort((a, b) => {
    return b.len - a.len
});

console.log(res.slice(0, 10));
<div>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam et ipsum eget purus maximus suscipit. Aliquam fringilla eros at lorem venenatis, et hendrerit neque ultrices. Suspendisse blandit, nulla eu hendrerit mattis, elit nibh blandit nibh, non scelerisque leo tellus placerat est. Phasellus dignissim velit metus. Sed quis urna et nunc hendrerit tempus quis eu neque. Vestibulum placerat massa eget sapien viverra fermentum. Aenean ac feugiat nibh, eu dignissim ligula. In hac habitasse platea dictumst. Nunc ipsum dolor, consectetur at justo eget, venenatis vulputate dui. Nulla facilisi. Suspendisse consequat pellentesque tincidunt. Nam aliquam mi a risus suscipit rutrum.

Donec porta enim at lectus scelerisque, non tristique ex interdum. Nam vehicula consequat feugiat. In dictum metus a porttitor efficitur. Praesent varius elit porta consectetur ornare. Mauris euismod ullamcorper arcu. Vivamus ante enim, mollis eget auctor quis, tristique blandit velit. Aliquam ut erat eu erat vehicula sodales. Vestibulum et lectus at neque sodales congue ut id nibh. Etiam congue ornare felis eget dictum. Donec quis nisl non arcu tincidunt iaculis.

Donec rutrum quam sit amet interdum mattis. Morbi eget fermentum dui. Morbi pulvinar nunc sed viverra sollicitudin. Praesent facilisis, quam ut malesuada lobortis, elit urna luctus nulla, sed condimentum dolor arcu id metus. Donec sit amet tempus massa. Nulla facilisi. Suspendisse egestas sollicitudin tempus. Fusce rutrum vel diam quis accumsan.

Etiam purus arcu, suscipit et fermentum vel, commodo a leo. Vestibulum varius purus felis, fringilla blandit lacus luctus varius. In tempus imperdiet risus ut imperdiet. Ut ut faucibus nunc. Vivamus augue orci, lobortis at enim non, faucibus pharetra est. Pellentesque ante arcu, rhoncus eu lectus nec, ornare molestie lorem. Suspendisse at cursus erat. Vivamus quis lacinia neque. Donec euismod neque eget purus faucibus hendrerit.

Fusce in ante placerat, aliquam mauris et, condimentum ligula. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris hendrerit egestas risus, at consequat metus interdum et. Proin ut tellus quis lorem auctor tempor. Mauris viverra ligula et finibus iaculis. Mauris quis enim a lorem bibendum cursus nec nec enim. Etiam porttitor ligula et erat sagittis vulputate. Fusce ornare mi quis ante faucibus mattis. Aliquam tristique libero sed magna dapibus, vitae sollicitudin lorem malesuada. Praesent dignissim malesuada tellus vitae facilisis. Nullam diam augue, tincidunt ut maximus non, convallis vel felis.
</div>


你的示例也非常有效,所以我很难选择最终的答案来回答这个问题,因为每个人都给出了有效的答案。感谢您的帮助,我很感激,并对解决同一问题的不同解决方案感到惊讶。在这种情况下,您使用的匹配方式令我惊讶和敬畏,因为您只用了很少的代码行数就解决了这个问题。 - D3181

1

你是否必须使用Array.prototype.reduce()?这个方法是将整个数组的元素减少到一个值,似乎不符合你的用例。如果你只想简单地计算单词出现的次数,我喜欢使用字典。

function bagOfWords(text, topCnt) {
  text= text.toLowerCase(); //make everything lower case
  var bag = text.split(" "); //remove blanks
  //Remove "." and possibly other punctuation?

  //build the word dictionary;
  var wordDict = {};
  for(idx in bag) {
    if(bag[idx] in wordDict) {
      wordDict[bag[idx]]++;
    }
    else {
      wordDict[bag[idx]] = 1;
    }
  }
  //get the duplicate free array
  var dupFree = [];
  for(word in wordDict) {
    dupFree.push(word);
  }
  //find the top topCnt;
  //Custom sort method to sort the array based on the dict we created.
  var sorted = dupFree.sort(function(a, b) {
    if (wordDict[a] > wordDict[b]) {
      return -1;
    }
    if (wordDict[a] < wordDict[b]) {
      return 1;
    }
    return 0;
  });
  
  //Now we can just return back the spliced array.  
  //NOTE - if there is a tie, it would not return all the ties.
  //  For instance, if there were twenty words with each having the same occurance, this would not return back all 20 of them.  To do that, you would need another loop.
  return sorted.slice(0,topCnt);
}

    var lorem = "Lorem ipsum dolor sit amet consectetur adipiscing elit Duis gravida, lectus vel semper porttitor nulla nulla semper tortor et maximus quam orci a nibh Duis vel aliquet est Aliquam at elit libero Duis molestie nisi et lectus fringilla vulputate Integer in faucibus dolor Vivamus leo magna, interdum sit amet arcu et vulputate aliquam elit Pellentesque vel imperdiet nisi maximus malesuada eros Aenean sit amet turpis lorem Pellentesque in scelerisque ante Nunc sed dignissim ex Quisque neque risus feugiat a felis vitae blandit tristique mauris Etiam pharetra eleifend felis ac cursus Pellentesque ac magna nec lectus interdum lacinia Fusce imperdiet libero accumsan dolor consectetur, sed finibus justo ornare. Vivamus vehicula ornare metus quis fermentum sapien ullamcorper non Cras non odio interdum facilisis elit sit amet facilisis risus";
 console.log(bagOfWords(lorem,10));

正如我在评论中提到的那样,肯定有一些可以改进的地方。这至少可以让你开始。这里的魔法是使用字典来删除重复项并计算出现次数,然后使用自定义排序函数按照你想要的顺序排列数组。

查看 MDN 获取所有 JavaScript 函数所需的内容。该网站是一个很棒的资源。https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort


感谢您的帮助,您提供的示例和链接对我解决这个问题非常有帮助,特别是您的评论。其他人的评论没有像您的那样详细地分解,这真的帮助我理解了您的过程。 - D3181

1
这里又来一个算法:

function myCounter(bagWords) {  

    // Create an array of bag words from string
    var bagMap = bagWords.toLowerCase().split(' ');

    // Count duplicates
    var bagCount = bagMap.reduce( (countWords, word) => {
        countWords[word] = ++countWords[word] || 1;
        return countWords;
    }, {});

    // Create a sorted array
    var bagResult = []; 
    return bagResult = Object.keys(bagCount).map(function(el, i) {
        bagResult.push(bagCount[el]);
        return { 
            word: el, 
            count: bagCount[el]
        };
    }).sort((a,b) => {      
        return b.count-a.count;
    }).slice(0,10); 

}

var bagWords = "Pat cat hat Tat bat rat pat Cat Hat tat Bat Rat test pat pat bat bat tap tac eat bat Ope ope Asd Eat dsa";
console.log(myCounter(bagWords));

也许它可以帮助某些人。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接