通过关键字段在MongoDB集合中查找所有重复的文档

Question

通过关键字段在MongoDB集合中查找所有重复的文档

mongodbmapreduceduplicatesaggregation-framework

62

假设我有一个包含一些文档的集合，类似于下面这样。

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}

我想通过 "name" 字段找到该集合中所有重复的条目。例如，"foo" 出现了两次，"bar" 出现了三次。

- frazman

你可以使用这个解决方案来去除重复项。 - Somnath Muluk

5个回答

18

注意：这个解决方案最易于理解，但并非最佳解决方案。

你可以使用mapReduce来查找文档中某个字段出现的次数：

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

- driangle

5

对于通用的Mongo解决方案，请参见MongoDB烹饪书中使用group查找重复项的配方。请注意，聚合比使用mapReduce更快且更强大，因为它可以返回重复记录的_id。

对于pymongo，接受的答案（使用mapReduce）效率不高。相反，我们可以使用group方法：

$connection = 'mongodb://localhost:27017';
$con        = new Mongo($connection); // mongo db connection

$db         = $con->test; // database 
$collection = $db->prb; // table

$keys       = array("name" => 1); Select name field, group by it

// set intial values
$initial    = array("count" => 0);

// JavaScript function to perform
$reduce     = "function (obj, prev) { prev.count++; }";

$g          = $collection->group($keys, $initial, $reduce);

echo "<pre>";
print_r($g);

输出将会是这个：

Array
(
    [retval] => Array
        (
            [0] => Array
                (
                    [name] => 
                    [count] => 1
                )

            [1] => Array
                (
                    [name] => MongoDB
                    [count] => 2
                )

        )

    [count] => 3
    [keys] => 2
    [ok] => 1
)

等价的SQL查询语句为：SELECT name, COUNT(name) FROM prb GROUP BY name。请注意，我们仍需要过滤掉计数为0的元素。有关使用group查找重复项的 MongoDB 食谱解决方案，请参阅MongoDB cookbook recipe。

- Prasanth Bendra

MongoDB食谱的链接已经过时，返回404错误。 - udachny

4

聚合管道框架可用于轻松识别具有重复键值的文档：

// Desired unique index: 
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})

db.collection.aggregate([
  { $group: { 
    _id: { firstField: "$firstField", secondField: "$secondField" }, 
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  }}, 
  { $match: { 
    count: { $gt: 1 } 
  }}
])

参考：官方Mongo Lab博客上有一些有用的信息：

https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework

（Note: 本文为引用外部链接，链接内为英文内容）

- Krunal Shah

2

这里最高被接受的答案是这样的：

uniqueIds: { $addToSet: "$_id" },

这也会返回一个名为uniqueIds的新字段和其id列表。但是如果你只想要该字段及其计数呢？那么应该这样写：

db.collection.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

为了解释这个问题，如果你来自像MySQL和PostgreSQL这样的SQL数据库，你习惯于使用聚合函数（例如COUNT()、SUM()、MIN()、MAX()），它们与GROUP BY语句一起工作，让你可以找到表中某列值出现的总次数。

SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type         |
+----------+-----------------+
|        3 | Contact         |
|        1 | Practice        |
|        1 | Prospect        |
|        1 | Task            |
+----------+-----------------+

正如您所看到的，我们的输出显示了每个my_type值出现的次数。要在MongoDB中查找重复项，我们将以类似的方式解决问题。MongoDB拥有聚合操作，可以将来自多个文档的值分组在一起，并可以对分组数据执行各种操作以返回单个结果。这与SQL中的聚合函数类似的概念。

假设有一个名为contacts的集合，初始设置如下：

db.contacts.aggregate([ ... ]);

这个聚合函数接受一个聚合操作符的数组，而在我们的情况下，我们需要使用 $group 操作符，因为我们的目标是按字段计数分组数据，也就是按字段值出现的次数进行分组。

db.contacts.aggregate([  
    {$group: { 
        _id: {name: "$name"} 
        } 
    }
]);

这种方法有一点特殊。使用 group by 运算符需要 _id 字段。在这种情况下，我们正在对 $name 字段进行分组。_id 中的键名可以是任何名称，但我们在这里使用 name 是因为它很直观。

仅使用 $group 运算符运行聚合，我们将获得所有名称字段的列表（无论它们在集合中出现一次还是多次）：

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

注意上面聚合的工作方式。它获取具有名称字段的文档，并返回提取的名称字段的新集合。

但我们想知道的是该字段值重复出现的次数。$group运算符使用$count字段，使用$sum运算符将表达式1添加到组中每个文档的总和中。因此，$group和$sum一起返回给定字段（例如名称）的所有数字值的集体总和。

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"},
    count: {$sum: 1}
    } 
  }
]);

{ "_id" : { "name" : "John" },  "count" : 1  }
{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }
{ "_id" : { "name" : "Amanda" },  "count" : 1 }

由于我们的目标是要消除重复项，因此需要一个额外的步骤。为了只获取那些拥有超过一个计数的组，我们可以使用 $match 运算符来过滤结果。在 $match 运算符内部，我们将告诉它查看计数字段并使用 $gt 运算符代表“大于”和数字1来查找大于1的计数。

db.contacts.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }

作为一个旁注，如果你正在使用像Ruby的Mongoid这样的ORM来操作MongoDB，你可能会遇到这个错误：

The 'cursor' option is required, except for aggregate with the explain argument

这很可能意味着你的ORM已经过时，正在执行MongoDB不再支持的操作。因此，要么更新你的ORM，要么找到一个解决方法。对于Mongoid，以下是我的解决方法：

module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end

- Daniel Viglione

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- expert · Accepted Answer

对于大型集合，被接受的答案速度非常缓慢，并且不会返回重复记录的_id。

使用聚合（aggregation）可以更快速地返回_id：

db.collection.aggregate([
  { $group: {
    _id: { name: "$name" },   // replace `name` here twice
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
]);

在聚合管道的第一阶段，$group操作符通过name字段对文档进行聚合，并在uniqueIds中存储分组记录的每个_id值。$sum操作符将传递给它的字段的值相加，在本例中为常量1 - 从而计算出分组记录的数量并存储在count字段中。

在管道的第二个阶段，我们使用$match来筛选至少有2个重复项（即重复项）的文档。

然后，我们按最常见的重复项进行排序，并将结果限制为前10个。

此查询将输出最多包含$limit个重复名称记录以及它们的_id。例如：

{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}