如何在Mongodb中根据关键字删除重复项？

Question

如何在Mongodb中根据关键字删除重复项？

62

我在MongoDB中有一个集合，其中大约有（~3百万条记录）。我的样本记录看起来像这样：

 { "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
   "source_references" : [
                           "_id" : ObjectId("5045xxxxxxxxxxxxxx"),
                           "name" : "xxx",
                           "key" : 123
                          ]
 }

我的集合中有很多重复记录，这些记录的source_references.key相同。（我指的是重复的source_references.key而不是_id）。

我想基于source_references.key 删除重复记录，我正在考虑编写一些PHP代码来遍历每个记录并删除存在的记录。

有没有办法在Mongo内部命令行中删除重复项？

- user1518659

8个回答

71

这是我在MongoDB 3.2上使用的最简单的查询。

db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
    db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})

在运行此操作之前，请索引您的customKey以提高速度

- Kanak Singhal

是的，@sara。它会删除所有重复项，直到您在删除查询中指定限制为止。 - Kanak Singhal

2

如果我需要搜索多个键而不仅仅是一个键，这该怎么办？ - eozzy

2

如果您想要删除最新的记录，请将$gt更改为$lt。db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){ db.myCollection.remove({_id:{$lt:doc._id}, myCustomKey:doc.myCustomKey}); }) - SteveO7

1

这似乎运行非常缓慢。我发现这个方法效果更好，但需要花费很多工作才能使其正常运行：https://dev59.com/4WYq5IYBdhLWcg3wzDrk#44522593 - wordsforthewise

@3zzy 要搜索多个键，您需要将另一个键添加到投影中。然后，在函数中，您需要在查询中添加相同的键。 - vaer-k

显示剩余3条评论

10

虽然@Stennie的回答是有效的，但这并不是唯一的方法。事实上，MongoDB手册要求您在执行此操作时要非常谨慎。还有两种其他选项

让MongoDB帮您使用Map Reduce来去重
- 另一种方法
您可以以编程方式进行，但效率较低。

- Aravind Yarram

8

这是一种稍微“手动”的方法：

基本上，首先获取您感兴趣的所有唯一键的列表。

然后使用每个键执行搜索，并在该搜索返回大于一时进行删除。

    db.collection.distinct("key").forEach((num)=>{
      var i = 0;
      db.collection.find({key: num}).forEach((doc)=>{
        if (i)   db.collection.remove({key: num}, { justOne: true })
        i++
      })
    });

- Fernando

我喜欢这种简单明了的方法，但我认为通过 _id 而不是键来删除文档更合理。所以在 if 语句内可以这样写： db.collection.remove({ _id: doc._id }) - salkcid

7

我有类似的需求，但我想保留最新的条目。以下查询适用于我的集合，该集合有数百万条记录和重复项。

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}

- Mayank Patel

4

在 Fernando 的回答的基础上进行拓展，但我发现这样做太耗时了，因此我进行了修改。

var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
  var i = 0;
  db.collection.find({ "field": fieldValue }).forEach(doc => {
    if (i) {
      db.collection.remove({ _id: doc._id });
    }
    i++;
    x += 1;
    if (x % 100 === 0) {
      print(x); // Every time we process 100 docs.
    }
  });
});

这次改进基本上是使用文档ID来进行删除操作，这样应该会更快，并且加入了操作进度，您可以将迭代值更改为所需的数量。

另外，在操作之前对字段进行索引会有所帮助。

- Computer's Guy

1

使用pip install mongo_remove_duplicate_indexes命令安装

使用任何语言创建脚本
迭代您的集合
创建新集合并在其中创建新索引，将unique设置为true。请记住，此索引必须与原始集合中要删除重复项的索引相同，并具有相同的名称。例如，您有一个名为gaming的集合，在该集合中，字段genre包含重复项，您希望将其删除，因此只需创建新集合 db.createCollection("cname") 创建新索引 db.cname.createIndex({'genre':1},unique:1) 现在，当您插入具有类似genre的文档时，只会接受第一个文档，其他文档将被拒绝并显示重复键错误
现在，只需将收到的json格式值插入新集合，并使用异常处理来处理异常例如：pymongo.errors.DuplicateKeyError

查看mongo_remove_duplicate_indexes软件包源代码以更好地理解

- user7106300

1

如果您有足够的内存，您可以在Scala中做类似这样的事情：

cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})

- gilcu2

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Stennie · Accepted Answer

此答案已过时：dropDups 选项在 MongoDB 3.0 中被删除，因此在大多数情况下需要采用不同的方法。例如，您可以使用聚合操作，如 MongoDB duplicate documents even after adding unique key 中所建议的。

如果您确定 source_references.key 标识了重复记录，则可以在 MongoDB 2.6 或更早版本中使用 dropDups:true 索引创建选项来确保唯一索引：

db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})

这将为每个source_references.key值保留第一个唯一文档，并删除任何可能导致重复键违规的后续文档。

重要提示：任何缺少source_references.key字段的文档都将被视为具有null值，因此随后缺少键字段的文档将被删除。您可以添加sparse:true索引创建选项，以便该索引仅适用于具有source_references.key字段的文档。

明显的警告：备份数据库，并在担心意外数据丢失的情况下首先在演示环境中尝试此操作。