优化MySQL Full outer join以处理海量数据

4
我们有以下MySQL表(为了直接进入主题而进行了简化):
CREATE TABLE `MONTH_RAW_EVENTS` (
  `idEvent` int(11) unsigned NOT NULL,
  `city` varchar(45) NOT NULL,
  `country` varchar(45) NOT NULL,
  `ts` datetime NOT NULL,
  `idClient` varchar(45) NOT NULL,
  `event_category` varchar(45) NOT NULL,
  ... bunch of other fields
  PRIMARY KEY (`idEvent`),
  KEY `idx_city` (`city`),
  KEY `idx_country` (`country`),
  KEY `idClient` (`idClient`),
) ENGINE=InnoDB;

CREATE TABLE `compilation_table` (
  `idClient` int(11) unsigned DEFAULT NULL,
  `city` varchar(200) DEFAULT NULL,
  `month` int(2) DEFAULT NULL,
  `year` int(4) DEFAULT NULL,
  `events_profile` int(10) unsigned NOT NULL DEFAULT '0',
  `events_others` int(10) unsigned NOT NULL DEFAULT '0',
  `events_total` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `idx_month` (`month`),
  KEY `idx_year` (`year`),
  KEY `idx_idClient` (`idClient`),
  KEY `idx_city` (`city`)
) ENGINE=InnoDB;

MONTH_RAW_EVENTS包含用户在网站上执行的操作,大约有2000万行数据,文件大小接近4GB。

compilation_table汇总了每个月客户和城市的信息。我们用它来实时显示网站的统计数据。

我们每个月只处理一次统计数据(从第一个表到第二个表),并且我们正在尝试优化执行此操作的查询(因为目前我们使用PHP处理所有内容需要很长时间)。

以下是我们想出的查询语句,在使用小数据集时似乎可以胜任工作,但对于完整的数据集处理需要超过6个小时。

INSERT INTO compilation_table (idClient,city,month,year,events_profile,events_others)


    SELECT  IFNULL(OTHERS.idClient,AP.idClient) as idClient,
            IF(IFNULL(OTHERS.city,AP.city)='','Others',IFNULL(OTHERS.city,AP.city)) as city,
        01,2014,
    IFNULL(AP.cnt,0) as events_profile,
        IFNULL(OTHERS.cnt,0) as events_others           

    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 LEFT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

 UNION

    SELECT  IFNULL(OTHERS.idClient,CLIPROFILE.idClient) as idClient,
            IF(IFNULL(OTHERS.city,CLIPROFILE.city)='','Others',IFNULL(OTHERS.city,CLIPROFILE.city)) as city,
            01,2014,
            IFNULL(CLIPROFILE.cnt,0) as events_profile,
            IFNULL(OTHERS.cnt,0) as events_others           
    FROM
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`!='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as OTHERS
 RIGHT JOIN 
    (
        SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt 
        FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
        AND `event_category`='CLIENT PROFILE'
        GROUP BY idClient,city
    ) as CLIPROFILE 
    ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient

我们想在MySQL中进行FULL Outer Join,查询的基本架构如这里所提出的那样。
我们该如何优化查询呢?我们尝试了不同的索引、调整各种参数,但是8个小时后仍然没有运行完毕。
MySQL服务器是一台Percona MySQL 5.5专用服务器,配有2个CPU、2GB内存和SSD硬盘。我们使用了Percona工具优化了服务器的配置。
非常感谢任何帮助。
谢谢。
1个回答

6

您正在执行 UNION 操作,这会导致 DISTINCT 处理。

通常最好将 Full Join 重写为 Left Join 加上 Right Join 的非匹配行(如果它是合适的1:n连接)。

OTHERS LEFT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
union all
OTHERS RIGHT JOIN CLIPROFILE 
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
WHERE OTHERS.idClient IS NULL 

此外,在将派生表的结果与其他表连接之前,您可以将其材料化为临时表中,这样计算只会执行一次(我不知道MySQL的优化器是否足够智能以自动执行此操作)。
此外,按城市/国家分组并在单独的列上进行连接可能更有效,并在外部步骤中使用CONCAT(city,', ',country) as city

太棒了,伙计,现在我们已经将整个流程优化到10分钟,而不是10小时了 :) 我认为创建这些临时表也有所帮助,首先尝试使用MEMORY表,但它们太大了无法适应我们的2GB服务器,最终采用了MyISAM,现在完美运行! - jmserra
我知道这篇帖子有点老了,但感谢@dnoeth的提示:性能有了惊人的提升。 - Luca

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接