在Mongo中，使用数组中的值合并重复文档

Question

在Mongo中，使用数组中的值合并重复文档

7

我有一个大型文档集合，其格式如下：

{ "_id": "5a760191813a54000b8475f1", "orders": [{ "row": "3", "seat": "11" }, { "row": "3", "seat": "12" }], "product_id": "5a7628bedbcc42000aa7f614" },
{ "_id": "5a75f6f17abe45000a3ba05e", "orders": [{ "row": "3", "seat": "12" }, { "row": "3", "seat": "13" }], "product_id": "5a7628bedbcc42000aa7f614" },
{ "_id": "5a75ebdf813a54000b8475e7", "orders": [{ "row": "5", "seat": "16" }, { "row": "5", "seat": "15" }], "product_id": "5a75f711dbcc42000c459efc" }

我需要能够找到任何一个文档中，product_id 和 orders 数组中的项目是否存在重复。我似乎无法理解如何实现这一点。有什么建议吗？

- Malmoc

3

你所说的重复是指product_id相同且至少有一个共同订单吗？例如上面的例子中，由于{ "row": "3", "seat": "12" }，所以存在重复吗？你想在查询之后得到什么输出数据？例如，只需要重复项的“_id”即可吗？还是你想保留所有文档信息？ - Takis

1

订单中的所有元素必须被复制吗？ - mohammad Naimi

3个回答

5

您可以使用以下查询。$unwind订单数组，按订单行和产品进行$group，并收集匹配的id和数量。保留计数大于1的文档。$lookup拉取匹配的文档，使用$replaceRoot展开文档。

db.collection.aggregate([
  {
    "$unwind": "$orders"
  },
  {
    "$group": {
      "_id": {
        "order": "$orders",
        "product_id": "$product_id"
      },
      "count": {
        "$sum": 1
      },
      "doc_ids": {
        "$push": "$_id"
      }
    }
  },
  {
    "$match": {
      "count": {
        "$gt": 1
      }
    }
  },
  {
    "$lookup": {
      "from": "collection",
      "localField": "doc_ids",
      "foreignField": "_id",
      "as": "documents"
    }
  },
  {
    "$unwind": "$documents"
  },
  {
    "$replaceRoot": {
      "newRoot": "$documents"
    }
  }
])

https://mongoplayground.net/p/YbztEGttUMx

- s7vr

2

虽然这可以纯粹在Mongo中完成，但我不建议这样做，因为它非常非常非常浪费内存。你基本上必须在执行某些操作时将整个集合保持在内存中。

但是，我将展示此流程，因为我们将与第二种更可扩展的方法一起使用它。

我们想要根据订单和产品ID进行分组，但有两个问题阻碍了我们。

1.订单字段可能在所有文档中排序不同，因为Mongo不支持“嵌套”排序，我们必须展开数组，对其进行排序并恢复原始结构。（请注意，在此步骤中，您在内存中对整个集合进行排序）。如果您可以确保在订单数组中维护排序顺序，则可以跳过此管道的此步骤，这是此管道的一个痛点之一。

2.Mongo在对对象数组进行分组时不一致。完全透明地说，我不完全确定其中发生了什么，但我猜测出于效率考虑进行了一些“捷径”，这些“捷径”会影响稳定性。因此，我们的方法是将这些对象转换为字符串（将“行”和“座位”连接在一起）。

db.collection.aggregate([
  {
    "$unwind": "$orders"
  },
  {
    $sort: {
      "orders.row": 1,
      "orders.seat": 1
    }
  },
  {
    $group: {
      _id: "$_id",
      tmpOrders: {
        $push: {
          $concat: [
            "$orders.row",
            "$orders.seat"
          ]
        }
      },
      product_id: {
        $first: "$product_id"
      }
    }
  },
  {
    $group: {
      _id: {
        orders: "$tmpOrders",
        product: "$product_id"
      },
      dupIds: {
        $push: "$_id"
      }
    }
  },
  {
    $match: {
      "dupIds.0": {
        $exists: true
      }
    }
  },
  {
    $project: {
      _id: 0,
      dups: "$dupIds",
      
    }
  }
])

Mongo Playground

如我所说，这种方法不具备可扩展性，在大型集合上运行时间会非常长。因此，我建议利用索引并迭代product_id，并分别执行每个管道。

// wraps the native Promise, not required.
import Bluebird = require('bluebird');

// very fast with index.
const productIds = await collection.distinct('product_id')

await Bluebird.map(productIds, async (productId) => {
    const dups = await collection.aggregate([
        {
            $match: {
                product_id: productId
            }
        }
         ... same pipeline ...
    ])
    
    if (dups.length) {
        // logic required.
    }
    // can control concurrency based on db workload.
}, { concurrency: 5})

使用这种方法时，请确保已经建立了基于product_id的索引，以便它能够高效地工作。

- Tom Slabbaert

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Takis · Accepted Answer

我不知道您需要什么输出，但是这里有关于重复项的信息，也许您还想在重复项上添加unwind。

结果文档

产品ID
订单（找到重复项）
重复项（具有该订单作为重复项的文档）

对于您的数据，将打印

[{
  "duplicates": [
    "5a760191813a54000b8475f1",
    "5a75f6f17abe45000a3ba05e"
  ],
  "order": {
    "row": "3",
    "seat": "12"
  },
  "product_id": "5a7628bedbcc42000aa7f614"
}]

查询
(在驱动程序上运行，MongoPlayground无法保持字段顺序，可能会显示错误的结果)

aggregate(
[{"$unwind" : {"path" : "$orders"}},
 {
  "$group" : {
    "_id" : {
      "orders" : "$orders",
      "product_id" : "$product_id"
    },
    "duplicates" : {
      "$push" : "$_id"
    }
  }
 },
 {"$match" : {"$expr" : {"$gt" : [ {"$size" : "$duplicates"}, 1 ]}}},
 {
  "$project" : {
    "_id" : 0,
    "order" : "$_id.orders",
    "product_id" : "$_id.product_id",
    "duplicates" : 1
  }
 } 
])

数据 (我添加了更多的数据)

[
  {
    "_id": "5a760191813a54000b8475f1",
    "orders": [
      {
        "row": "3",
        "seat": "11"
      },
      {
        "row": "3",
        "seat": "12"
      }
    ],
    "product_id": "5a7628bedbcc42000aa7f614"
  },
  {
    "_id": "5a75f6f17abe45000a3ba05g",
    "orders": [
      {
        "row": "3",
        "seat": "12"
      },
      {
        "row": "3",
        "seat": "13"
      }
    ],
    "product_id": "5a7628bedbcc42000aa7f614"
  },
  {
    "_id": "5a75f6f17abe45000a3ba05e",
    "orders": [
      {
        "row": "3",
        "seat": "12"
      },
      {
        "row": "3",
        "seat": "13"
      }
    ],
    "product_id": "5a7628bedbcc42000aa7f614"
  },
  {
    "_id": "5a75ebdf813a54000b8475e7",
    "orders": [
      {
        "row": "5",
        "seat": "16"
      },
      {
        "row": "5",
        "seat": "15"
      }
    ],
    "product_id": "5a75f711dbcc42000c459efc"
  }
]

结果

[{
  "duplicates": [
    "5a75f6f17abe45000a3ba05g",
    "5a75f6f17abe45000a3ba05e"
  ],
  "order": {
    "row": "3",
    "seat": "13"
  },
  "product_id": "5a7628bedbcc42000aa7f614"
},
{
  "duplicates": [
    "5a760191813a54000b8475f1",
    "5a75f6f17abe45000a3ba05g",
    "5a75f6f17abe45000a3ba05e"
  ],
  "order": {
    "row": "3",
    "seat": "12"
  },
  "product_id": "5a7628bedbcc42000aa7f614"
}]