使用MongoDB按多个字段分组值

190
例如,我有这些文件:

For example, I have these documents:

{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book5"
},
{
  "addr": "address3",
  "book": "book9"
},
{
  "addr": "address2",
  "book": "book5"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address15",
  "book": "book1"
},
{
  "addr": "address9",
  "book": "book99"
},
{
  "addr": "address90",
  "book": "book33"
},
{
  "addr": "address4",
  "book": "book3"
},
{
  "addr": "address5",
  "book": "book1"
},
{
  "addr": "address77",
  "book": "book11"
},
{
  "addr": "address1",
  "book": "book1"
}

等等其他的。


我该如何发出一个请求,描述每个地址前N个地址和每个地址前M本书的信息?

期望结果举例:

地址1 | 书籍1: 5
| 书籍2: 10
| 书籍3: 50
| 总计: 65
______________________
地址2 | 书籍1: 10
| 书籍2: 10
|...
| 书籍M: 10
| 总计: M*10
...
______________________
地址N | 书籍1: 20
| 书籍2: 20
|...
| 书籍M: 20
| 总计: M*20

4个回答

326

TLDR 摘要

在现代 MongoDB 发布版中,您可以使用基本聚合结果之外的 $slice 来轻松解决这个问题。对于“大型”结果,改为每个分组运行并行查询(演示清单在答案末尾),或等待 SERVER-9377 解决,该问题将允许将条目数限制为要 $push 到数组中的数量。

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$project": {
        "books": { "$slice": [ "$books", 2 ] },
        "count": 1
    }}
])

MongoDB 3.6 预览版

这个版本仍未解决 SERVER-9377 的问题,但在此版本中,$lookup 提供了一个新的“非相关(non-correlated)”选项,该选项接受一个“pipeline”表达式作为参数,而不是“localFields”和“foreignFields”选项。这样可以使用另一个管道表达式进行“自连接(self-join)”,并可以应用$limit以返回“前n个”结果。

db.books.aggregate([
  { "$group": {
    "_id": "$addr",
    "count": { "$sum": 1 }
  }},
  { "$sort": { "count": -1 } },
  { "$limit": 2 },
  { "$lookup": {
    "from": "books",
    "let": {
      "addr": "$_id"
    },
    "pipeline": [
      { "$match": { 
        "$expr": { "$eq": [ "$addr", "$$addr"] }
      }},
      { "$group": {
        "_id": "$book",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1  } },
      { "$limit": 2 }
    ],
    "as": "books"
  }}
])

另一个新增功能当然是通过$expr插值变量,并使用$match选择“连接”中匹配的项目,但总体前提是一个“管道内嵌套管道”,其中内部内容可以通过父级匹配进行过滤。由于它们本身都是“管道”,因此我们可以分别为每个结果$limit

这将是运行并行查询的下一个最佳选项,实际上如果在“子管道”处理中允许使用索引并且能够使用$match,那么它将更好地发挥作用。所以它没有使用“限制到$push”作为参考问题所要求的,实际上提供了应该更有效的解决方案。


原始内容

你似乎已经遇到了前N个问题。从某种意义上说,你的问题很容易解决,尽管不能完全满足你的限制要求:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 }
])

现在这将会给你一个类似于这样的结果:

{
    "result" : [
            {
                    "_id" : "address1",
                    "books" : [
                            {
                                    "book" : "book4",
                                    "count" : 1
                            },
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 3
                            }
                    ],
                    "count" : 5
            },
            {
                    "_id" : "address2",
                    "books" : [
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 2
                            }
                    ],
                    "count" : 3
            }
    ],
    "ok" : 1
}

所以,这与你所要求的不同,虽然我们确实可以获得地址值的前几个结果,但基础的“书籍”选择并不仅限于必须的结果数量。

这证明非常困难,但是如果你需要匹配的项目数量增加,复杂性也会随之增加。为了简单起见,我们最多可以保持2个匹配项:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$unwind": "$books" },
    { "$sort": { "count": 1, "books.count": -1 } },
    { "$group": {
        "_id": "$_id",
        "books": { "$push": "$books" },
        "count": { "$first": "$count" }
    }},
    { "$project": {
        "_id": {
            "_id": "$_id",
            "books": "$books",
            "count": "$count"
        },
        "newBooks": "$books"
    }},
    { "$unwind": "$newBooks" },
    { "$group": {
      "_id": "$_id",
      "num1": { "$first": "$newBooks" }
    }},
    { "$project": {
        "_id": "$_id",
        "newBooks": "$_id.books",
        "num1": 1
    }},
    { "$unwind": "$newBooks" },
    { "$project": {
        "_id": "$_id",
        "num1": 1,
        "newBooks": 1,
        "seen": { "$eq": [
            "$num1",
            "$newBooks"
        ]}
    }},
    { "$match": { "seen": false } },
    { "$group":{
        "_id": "$_id._id",
        "num1": { "$first": "$num1" },
        "num2": { "$first": "$newBooks" },
        "count": { "$first": "$_id.count" }
    }},
    { "$project": {
        "num1": 1,
        "num2": 1,
        "count": 1,
        "type": { "$cond": [ 1, [true,false],0 ] }
    }},
    { "$unwind": "$type" },
    { "$project": {
        "books": { "$cond": [
            "$type",
            "$num1",
            "$num2"
        ]},
        "count": 1
    }},
    { "$group": {
        "_id": "$_id",
        "count": { "$first": "$count" },
        "books": { "$push": "$books" }
    }},
    { "$sort": { "count": -1 } }
])

这将从前两个“address”条目中获取前2个“books”。但对我来说,最好使用第一种形式,然后简单地“切片”返回的数组元素以取出前N个元素。


演示代码

演示代码适用于当前LTS版本的NodeJS,从v8.x和v10.x版本发布。这主要是为了支持async/await语法,但在一般流程中没有任何限制,并且可以很少修改地适应普通的promises甚至回调实现。

index.js

const { MongoClient } = require('mongodb');
const fs = require('mz/fs');

const uri = 'mongodb://localhost:27017';

const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const client = await MongoClient.connect(uri);

    const db = client.db('bookDemo');
    const books = db.collection('books');

    let { version } = await db.command({ buildInfo: 1 });
    version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);

    // Clear and load books
    await books.deleteMany({});

    await books.insertMany(
      (await fs.readFile('books.json'))
        .toString()
        .replace(/\n$/,"")
        .split("\n")
        .map(JSON.parse)
    );

    if ( version >= 3.6 ) {

    // Non-correlated pipeline with limits
      let result = await books.aggregate([
        { "$group": {
          "_id": "$addr",
          "count": { "$sum": 1 }
        }},
        { "$sort": { "count": -1 } },
        { "$limit": 2 },
        { "$lookup": {
          "from": "books",
          "as": "books",
          "let": { "addr": "$_id" },
          "pipeline": [
            { "$match": {
              "$expr": { "$eq": [ "$addr", "$$addr" ] }
            }},
            { "$group": {
              "_id": "$book",
              "count": { "$sum": 1 },
            }},
            { "$sort": { "count": -1 } },
            { "$limit": 2 }
          ]
        }}
      ]).toArray();

      log({ result });
    }

    // Serial result procesing with parallel fetch

    // First get top addr items
    let topaddr = await books.aggregate([
      { "$group": {
        "_id": "$addr",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1 } },
      { "$limit": 2 }
    ]).toArray();

    // Run parallel top books for each addr
    let topbooks = await Promise.all(
      topaddr.map(({ _id: addr }) =>
        books.aggregate([
          { "$match": { addr } },
          { "$group": {
            "_id": "$book",
            "count": { "$sum": 1 }
          }},
          { "$sort": { "count": -1 } },
          { "$limit": 2 }
        ]).toArray()
      )
    );

    // Merge output
    topaddr = topaddr.map((d,i) => ({ ...d, books: topbooks[i] }));
    log({ topaddr });

    client.close();

  } catch(e) {
    console.error(e)
  } finally {
    process.exit()
  }

})()

books.json

:图书的JSON文件。
{ "addr": "address1",  "book": "book1"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book5"  }
{ "addr": "address3",  "book": "book9"  }
{ "addr": "address2",  "book": "book5"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book1"  }
{ "addr": "address15", "book": "book1"  }
{ "addr": "address9",  "book": "book99" }
{ "addr": "address90", "book": "book33" }
{ "addr": "address4",  "book": "book3"  }
{ "addr": "address5",  "book": "book1"  }
{ "addr": "address77", "book": "book11" }
{ "addr": "address1",  "book": "book1"  }

显然,在MongoDB 5.0中,$lookup内的子管道在特定条件下可以使用索引进行匹配($eq/$lt/$lte/$gt/$gte运算符;没有多键索引;不能与数组或未定义进行比较;不超过一个字段路径)。 - saltire
如果我没记错的话,在第二次分组之前,TLDR 不会起作用,因为没有任何保证这些是 topM 书籍。 - nimrod serok
SERVER-9377 的解决状态为 已修复 - Walter Tross
在列出键时,您无需使用对象,只需使用"$key_name"值的数组即可。 - meridius

78

使用下面这样的聚合函数:

[
{$group: {_id : {book : '$book',address:'$addr'}, total:{$sum :1}}},
{$project : {book : '$_id.book', address : '$_id.address', total : '$total', _id : 0}}
]

它将会给你以下的结果:

        {
            "total" : 1,
            "book" : "book33",
            "address" : "address90"
        }, 
        {
            "total" : 1,
            "book" : "book5",
            "address" : "address1"
        }, 
        {
            "total" : 1,
            "book" : "book99",
            "address" : "address9"
        }, 
        {
            "total" : 1,
            "book" : "book1",
            "address" : "address5"
        }, 
        {
            "total" : 1,
            "book" : "book5",
            "address" : "address2"
        }, 
        {
            "total" : 1,
            "book" : "book3",
            "address" : "address4"
        }, 
        {
            "total" : 1,
            "book" : "book11",
            "address" : "address77"
        }, 
        {
            "total" : 1,
            "book" : "book9",
            "address" : "address3"
        }, 
        {
            "total" : 1,
            "book" : "book1",
            "address" : "address15"
        }, 
        {
            "total" : 2,
            "book" : "book1",
            "address" : "address2"
        }, 
        {
            "total" : 3,
            "book" : "book1",
            "address" : "address1"
        }

我不太明白您期望的结果格式,所以请随意将其修改为您需要的格式。


1
这只解决了问题的一部分,而且不能对两个分组进行“top”操作。 - WiredPrairie
此外,就@WiredPrairie的评论而言,我无法看出这个解决方案如何解决所提出的问题的任何部分。“每个地址的前N个地址和前N本书”。 - Neil Lunn
如果您能帮忙解决与MongoDB相关的问题,麻烦访问以下链接:https://stackoverflow.com/questions/61067856/calculate-dwell-time-between-2-statuses-of-a-field - newdeveloper
标题是“MongoDB如何按多个字段分组”,这正是我感兴趣的。使用{$group: {_id: {book: '$book', address: '$addr'}, ...}可以回答这个问题。谢谢 :) - Bálint Sass

15

以下查询将提供与所需响应中给出的完全相同的结果:

db.books.aggregate([
    {
        $group: {
            _id: { addresses: "$addr", books: "$book" },
            num: { $sum :1 }
        }
    },
    {
        $group: {
            _id: "$_id.addresses",
            bookCounts: { $push: { bookName: "$_id.books",count: "$num" } }
        }
    },
    {
        $project: {
            _id: 1,
            bookCounts:1,
            "totalBookAtAddress": {
                "$sum": "$bookCounts.count"
            }
        }
    }

]) 

响应结果将如下所示:

/* 1 */
{
    "_id" : "address4",
    "bookCounts" : [
        {
            "bookName" : "book3",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 2 */
{
    "_id" : "address90",
    "bookCounts" : [
        {
            "bookName" : "book33",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 3 */
{
    "_id" : "address15",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 4 */
{
    "_id" : "address3",
    "bookCounts" : [
        {
            "bookName" : "book9",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 5 */
{
    "_id" : "address5",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 6 */
{
    "_id" : "address1",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 3
        },
        {
            "bookName" : "book5",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 4
},

/* 7 */
{
    "_id" : "address2",
    "bookCounts" : [
        {
            "bookName" : "book1",
            "count" : 2
        },
        {
            "bookName" : "book5",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 3
},

/* 8 */
{
    "_id" : "address77",
    "bookCounts" : [
        {
            "bookName" : "book11",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
},

/* 9 */
{
    "_id" : "address9",
    "bookCounts" : [
        {
            "bookName" : "book99",
            "count" : 1
        }
    ],
    "totalBookAtAddress" : 1
}

在每个组的“booksCounts”列表中,是否可以对元素进行排序?这个答案真的帮了我聚合一些数据,但是我想按日期对每个组的数据进行排序。 - Pila

7

自从mongoDB 3.6版本以来,使用$group$slice$limit$sort就很容易实现:

  1. $group图书并计算它们的数量
  2. $sort将它们按计数后推送
  3. $group地址分组, $push相关图书,然后通过$sum计算每个地址的总数。
  4. $sort按地址总数排序
  5. $limit 地址结果为topN
  6. 使用$slice限制数组中的图书数量为topM
db.collection.aggregate([
  {$group: {_id: {book: "$book",  addr: "$addr"}, count: {$sum: 1}}},
  {$sort: {"_id.addr": 1, count: -1}},
  {$group: {
      _id: "$_id.addr", totalCount: {$sum: "$count"}, 
      books: {$push: {book: "$_id.book", count: "$count"}}
    }
  },
  {$sort: {totalCount: -1}},
  {$limit: topN},
  {$set: {addr: "$_id", _id: "$$REMOVE", books: {$slice: ["$books", 0, topM]}}}
])

请参考playground example-v3.4,了解其运作方式。

在mongoDB 5.2版本中,有一个topN累加器,可以进一步简化操作:

db.collection.aggregate([
  {$group: {_id: {book: "$book",  addr: "$addr"}, count: {$sum: 1}}},
  {$group: {
      _id: "$_id.addr",
      totalCount: {$sum: "$count"},
      books: {$topN: {output: {book: "$_id.book", count: "$count"},
                      sortBy: {count: -1},
                      n: topM
      }}
  }},
  {$sort: {totalCount: -1}},
  {$limit: topN},
  {$project: {addr: "$_id", _id: 0, books: 1, totalCount: 1}}
])

playground example-v5.2上查看它的工作原理。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接