按组选择最常见的 MySQL SELECT

9

如何在MySQL中获取每个标记中最常出现的类别?理想情况下,我希望模拟一个聚合函数来计算列的众数

SELECT 
  t.tag 
  , s.category 
FROM tags t 
LEFT JOIN stuff s 
USING (id) 
ORDER BY tag;

+------------------+----------+
| tag              | category |
+------------------+----------+
| automotive       |        8 |
| ba               |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| bamboo           |       10 |
| bamboo           |        8 |
| bamboo           |        9 |
| bamboo           |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| banana tree      |        8 |
| bath             |        9 |
+-----------------------------+

只是想提一下,几年后更聪明了 - 不要像这样组织标签,这是反模式。使用many2many表来定义标签和项目之间的关系。话虽如此,我仍然希望MySQL中有一个MODE聚合函数。 - Stephen Fuhry
在问题中没有必要将stuff缩写为s,这样会让人更难以发现category属于s表而不是t表,因为它们都是单个字符。 - EoghanM
5个回答

4
SELECT t1.*
FROM (SELECT tag, category, COUNT(*) AS count
      FROM tags INNER JOIN stuff USING (id)
      GROUP BY tag, category) t1
LEFT OUTER JOIN 
     (SELECT tag, category, COUNT(*) AS count
      FROM tags INNER JOIN stuff USING (id)
      GROUP BY tag, category) t2
  ON (t1.tag = t2.tag AND (t1.count < t2.count 
      OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;

我同意这对于单个SQL查询来说有些过分了。在子查询中使用GROUP BY让我感到不舒服。你可以通过使用视图使它看起来更简单:

CREATE VIEW count_per_category AS
    SELECT tag, category, COUNT(*) AS count
    FROM tags INNER JOIN stuff USING (id)
    GROUP BY tag, category;

SELECT t1.*
FROM count_per_category t1
LEFT OUTER JOIN count_per_category t2
  ON (t1.tag = t2.tag AND (t1.count < t2.count 
      OR t1.count = t2.count AND t1.category < t2.category))
WHERE t2.tag IS NULL
ORDER BY t1.count DESC;

但它基本上在幕后执行相同的工作。

您评论说可以在应用程序代码中轻松执行类似的操作。那么为什么不这样做呢?执行更简单的查询以获取每个类别的计数:

SELECT tag, category, COUNT(*) AS count
FROM tags INNER JOIN stuff USING (id)
GROUP BY tag, category;

并在应用程序代码中对结果进行排序。


我一直在尝试让它正常运行,但困难重重。似乎最好制作一个聚合函数MOST_FREQUENT()。我要看看是否在我的技能水平范围内。 - Stephen Fuhry
抱歉,我误解了你的架构。我仔细查看并模拟了一个测试数据库,以确保查询有效。请尝试上面编辑过的版本。 - Bill Karwin
那似乎可以工作。但是有点难以接受,而且有两个子查询而不是一个。我希望有一个内置的聚合函数MEAN()或类似的东西 :-P。我可能可以在5分钟内使用C语言编写它。 - Stephen Fuhry
请看下面的分隔线后面的附加内容。 - Bill Karwin

4
SELECT  tag, category
FROM    (
        SELECT  @tag <> tag AS _new,
                @tag := tag AS tag,
                category, COUNT(*) AS cnt
        FROM    (
                SELECT  @tag := ''
                ) vars,
                stuff
        GROUP BY
                tag, category
        ORDER BY
                tag, cnt DESC
        ) q
WHERE   _new

根据您的数据,这将返回以下结果:
'automotive',  8
'ba',          8
'bamboo',      8
'bananatree',  8
'bath',        9

这里是测试脚本:

CREATE TABLE stuff (tag VARCHAR(20) NOT NULL, category INT NOT NULL);

INSERT
INTO    stuff
VALUES
('automotive',8),
('ba',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bamboo',10),
('bamboo',8),
('bamboo',9),
('bamboo',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bananatree',8),
('bath',9);

这应该是正确的答案。它避免了通过查询自我加入组,并且应该更有效率。它执行单个 group by 并迭代它,标记我们需要的行,然后过滤它们。聪明! - Kjell Rilbe

3

在子查询中使用LIMIT很容易实现。MySQL是否仍然有不允许在子查询中使用LIMIT的限制?下面的示例是使用PostgreSQL。

=> select tag, (select category from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS category, (select count(*) from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS num_items from stuff s group by tag;
    tag     | category | num_items 
------------+----------+-----------
 ba         |        8 |         1
 automotive |        8 |         1
 bananatree |        8 |         4
 bath       |        9 |         1
 bamboo     |        8 |         9
(5 rows)

如果需要计数,则第三列是必需的。


1
这是针对简单情况的:
SELECT action, COUNT(action) AS ActionCount FROM log GROUP BY action ORDER BY ActionCount DESC;

0
这是一个有点取巧的方法,利用了 max 聚合函数,因为 MySQL 中没有模式聚合函数(或窗口函数等)可以实现这一点:
SELECT  
  tag, 
  convert(substring(max(concat(lpad(c, 20, '0'), category)), 21), int) 
        AS most_frequent_category 
FROM (
    SELECT tag, category, count(*) AS c
    FROM tags INNER JOIN stuff using (id) 
    GROUP BY tag, category
) as grouped_cats 
GROUP BY tag;

基本上,它利用了我们可以找到每个单独类别计数的词汇最大值这一事实。

使用命名类别更容易理解:

create temporary table tags (id int auto_increment primary key, tag character varying(20));
create temporary table stuff (id int, category character varying(20));
insert into tags (tag) values ('automotive'), ('ba'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('banana tree'), ('banana tree'), ('banana tree'), ('banana tree'), ('bath');
insert into stuff (id, category) values (1, 'cat-8'), (2, 'cat-8'), (3, 'cat-8'), (4, 'cat-8'), (5, 'cat-8'), (6, 'cat-8'), (7, 'cat-8'), (8, 'cat-10'), (9, 'cat-8'), (10, 'cat-9'), (11, 'cat-8'), (12, 'cat-10'), (13, 'cat-8'), (14, 'cat-9'), (15, 'cat-8'), (16, 'cat-8'), (17, 'cat-8'), (18, 'cat-8'), (19, 'cat-8'), (20, 'cat-9');

在这种情况下,我们不应该对most_frequent_category列进行整数转换:

SELECT 
  tag, 
  substring(max(concat(lpad(c, 20, '0'), category)), 21) AS most_frequent_category 
FROM (
    SELECT tag, category, count(*) AS c
    FROM tags INNER JOIN stuff using (id) 
    GROUP BY tag, category
) as grouped_cats 
GROUP BY tag;

+-------------+------------------------+
| tag         | most_frequent_category |
+-------------+------------------------+
| automotive  | cat-8                  |
| ba          | cat-8                  |
| bamboo      | cat-8                  |
| banana tree | cat-8                  |
| bath        | cat-9                  |
+-------------+------------------------+

为了更深入地了解正在发生的事情,以下是 grouped_cats 内部查询的样子(我添加了 order by tag, c desc):

+-------------+----------+---+
| tag         | category | c |
+-------------+----------+---+
| automotive  | cat-8    | 1 |
| ba          | cat-8    | 1 |
| bamboo      | cat-8    | 9 |
| bamboo      | cat-10   | 2 |
| bamboo      | cat-9    | 2 |
| banana tree | cat-8    | 4 |
| bath        | cat-9    | 1 |
+-------------+----------+---+

我们可以看到如果省略substring部分,count(*)列的最大值将带着它所属的类别一起移动:

SELECT 
  tag, 
  max(concat(lpad(c, 20, '0'), category)) AS xmost_frequent_category
FROM (
    SELECT tag, category, count(*) AS c
    FROM tags INNER JOIN stuff using (id) 
    GROUP BY tag, category
) as grouped_cats 
GROUP BY tag;

+-------------+---------------------------+
| tag         | xmost_frequent_category   |
+-------------+---------------------------+
| automotive  | 00000000000000000001cat-8 |
| ba          | 00000000000000000001cat-8 |
| bamboo      | 00000000000000000009cat-8 |
| banana tree | 00000000000000000004cat-8 |
| bath        | 00000000000000000001cat-9 |
+-------------+---------------------------+

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接