如何在MongoDB中计算数组元素的出现次数？

Question

如何在MongoDB中计算数组元素的出现次数？

4

我有一组10,000个txt文件，其中包含旧的维基百科文章。这些文章被加载到一个mongoDB集合中，使用自定义的Java程序。

每篇文章的文档如下：

{ 
"_id" : ObjectID("....."),
"doc_id" : 335814,
"terms" : 
    [
          "2012", "2012", "adam", "knick", "basketball", ....
    ]
}

现在我想计算数组中每个单词的出现次数，即所谓的词频。

生成的文档应该是这样的：

{
"doc_id" : 335814,
"term_tf": [
      {term: "2012", tf: 2},
      {term: "adam", tf: 1},
      {term: "knick", tf: 1},
      {term: "basketball", tf: 1},
      .....
      ]
}

但是到目前为止，我所能够实现的就只有这样了：

db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"},  tf: {$sum : 1}}}], { allowDiskUse:true } );

{ "_id" : { "id" : 335814, "term" : "2012" }, "tf" : 2 }
{ "_id" : { "id" : 335814, "term" : "adam" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "knick" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "basketball" }, "tf" : 1 }

但是你可以看到，文档结构并不符合我的需求。我只想要一次doc_id，然后是一个包含所有术语及其相应频率的数组。

所以我在寻找与 $unwind 操作相反的操作。

感谢您的帮助。

- s1m0on

1

在管道中只需要另一个$group来将术语推回数组：https://docs.mongodb.org/manual/reference/operator/aggregation/push/ - Alex Blex

当我尝试添加另一个 $group 时，查询失败并显示以下错误信息：BufBuilder attempted to grow() to 134217728 bytes, past the 64MB limit.", "code" : 13548我的聚合管道语句如下：

db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"},  tf: {$sum : 1}}}, {$group: {_id: "$id", term_tf: {$push: {term: "$term", tf: "$tf"}}}}], {allowDiskUse:true});

- s1m0on

注释不是代码片段的最佳位置。基本上，聚合不能返回超过64MB的数据，您需要使用https://docs.mongodb.org/manual/reference/operator/aggregation/out将其写入集合中。请参见下面的答案。 - Alex Blex

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Blex · Accepted Answer

使用第二个$group和$out，您的管道应如下所示：

db.stemmedTerms.aggregate([
    {$unwind: "$terms" }, 
    // count
    {$group: {
        _id: {id: "$doc_id", term: "$terms"},  
        tf: {$sum : 1}  
    }},
    // build array
    {$group: {
        _id: "$_id.id",  
        term_tf: {$push:  { term: "$_id.term", tf: "$tf" }}
    }},
    // write to new collection
    { $out : "occurences" }     
], 
{ allowDiskUse: true});