优化大型关键词表?

3

I have a huge table like

CREATE TABLE IF NOT EXISTS `object_search` (
  `keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
  `object_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;

本数据集中包含大约3900万行数据(占用超过1 GB的空间),其中记录了1百万个对象表(object_id指向该表)的索引数据。

现在,如果要使用类似以下查询来搜索数据:

SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2

已经比在object表的keywords字段上执行LIKE搜索快得多,但仍需花费最多1分钟。

它的解释如下:

+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | type | possible_keys | key     | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | search | ref  | PRIMARY       | PRIMARY | 42      | const | 345180 |   100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+

完整的解释包括连接了objectobject_colorobject_locale表,而上述查询在子查询中运行以避免开销,如下所示:

+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table             | type   | possible_keys | key       | key_len | ref              | rows   | filtered | Extra                           |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
|  1 | PRIMARY     | <derived2>        | ALL    | NULL          | NULL      | NULL    | NULL             | 182544 |   100.00 | Using temporary; Using filesort |
|  1 | PRIMARY     | object_color      | eq_ref | object_id     | object_id | 4       | search.object_id |      1 |   100.00 |                                 |
|  1 | PRIMARY     | locale            | eq_ref | object_id     | object_id | 4       | search.object_id |      1 |   100.00 |                                 |
|  1 | PRIMARY     | object            | eq_ref | PRIMARY       | PRIMARY   | 4       | search.object_id |      1 |   100.00 |                                 |
|  2 | DERIVED     | search            | ref    | PRIMARY       | PRIMARY   | 42      |                  | 345180 |   100.00 | Using where; Using index        |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+

我的首要目标是能够在1到2秒内快速浏览此内容。

那么,是否有其他技术可以提高关键词的搜索速度呢?


更新 2013-08-06:

应用了大部分 Neville K 的建议后,我现在有以下设置:

CREATE TABLE `object_search_keyword` (
  `keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
  PRIMARY KEY (`keyword_id`),
  FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;

CREATE TABLE `object_search` (
  `keyword_id` int(10) unsigned NOT NULL,
  `object_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

新查询的解释如下:
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table          | type     | possible_keys      | key        | key_len | ref                       | rows    | filtered | Extra                                        |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>     | ALL      | NULL               | NULL       | NULL    | NULL                      |   24381 |   100.00 | Using temporary; Using filesort              |
|  1 | PRIMARY     | object_color   | eq_ref   | object_id          | object_id  | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  1 | PRIMARY     | object         | eq_ref   | PRIMARY            | PRIMARY    | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  1 | PRIMARY     | locale         | eq_ref   | object_id          | object_id  | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  2 | DERIVED     | <derived4>     | system   | NULL               | NULL       | NULL    | NULL                      |       1 |   100.00 |                                              |
|  2 | DERIVED     | <derived3>     | ALL      | NULL               | NULL       | NULL    | NULL                      |   24381 |   100.00 |                                              |
|  4 | DERIVED     | NULL           | NULL     | NULL               | NULL       | NULL    | NULL                      |    NULL |     NULL | No tables used                               |
|  3 | DERIVED     | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0       |                           |       1 |   100.00 | Using where; Using temporary; Using filesort |
|  3 | DERIVED     | object_search  | ref      | PRIMARY            | PRIMARY    | 4       | object_keyword.keyword_id | 2190225 |   100.00 | Using index                                  |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+

许多派生查询都来自于将比较子查询嵌套到另一个仅计算返回行数的子查询中的关键字:
SELECT SQL_NO_CACHE object.object_id, ..., @rn AS numrows
FROM (
    SELECT *, @rn := @rn + 1
    FROM (
        SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
        FROM object_keyword AS kwd
        INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
        WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
        GROUP BY search.object_id HAVING hits = 2
    ) AS numrowswrapper
    CROSS JOIN (SELECT @rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC

上述查询实际上需要运行约6秒,因为它搜索了两个关键字。我搜索的关键字越多,搜索速度就越快。

有什么方法可以进一步优化吗?


更新于2013年08月07日

阻塞的问题似乎几乎肯定是附加的ORDER BY语句。如果没有它,查询将在不到一秒钟内执行。

那么,有没有更快地排序结果的方法?欢迎任何建议,即使是需要在其他地方进行后处理的hackish方法。


2013年08月07日更新当天晚些时候

好的女士们先生们,将WHEREORDER BY语句嵌套在另一个子查询层中,以免让它干扰它不需要的表,大致将其性能再次提高了一倍:

SELECT wowrapper.*, locale.title
FROM (
    SELECT SQL_NO_CACHE object.object_id, ..., @rn AS numrows
    FROM (
        SELECT *, @rn := @rn + 1
        FROM (
            SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
            FROM object_keyword AS kwd
            INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
            WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
            GROUP BY search.media_id HAVING hits = 1
        ) AS numrowswrapper
        CROSS JOIN (SELECT @rn := 0) CONST
    ) AS search 
    INNER JOIN object AS object ON (search.object_id = object.object_id) 
    LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
    WHERE 1
    ORDER BY object.object_id DESC
) AS wowrapper 
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id) 
LIMIT 0,48

以前搜索一个关键词(大约200k结果)需要12秒,现在只需6秒,搜索两个关键词(60k结果)需要6秒,现在只需3.5秒。

虽然这已经是巨大的改进,但是还有机会再提高吗?


2013-08-08早更新

撤销了最后一次嵌套查询的变化,因为它实际上减慢了其他类型的变化……我现在尝试使用不同的表布局和MyISAM索引来创建一个专用搜索表,其中包含一个组合关键字字段(以逗号分隔在TEXT字段中)。


2013-08-08更新

好吧,简单的全文索引并不是很有用。

回到之前的设置,唯一阻碍的就是 ORDER BY(因为它需要使用临时表和文件排序)。如果没有它,搜索可以在不到一秒钟内完成!

所以,最终问题是:
如何优化 ORDER BY 语句以更快地运行,可能通过消除临时表的使用来实现?


你能否发布查询的 EXPLAIN 结果? - Vatev
我刚刚运行了一个 OPTIMIZE TABLE,现在它只需要10到30秒。 - Cobra_Fast
很难回答你的更新问题,因为你并没有拿苹果和苹果进行比较 - 在使用模式更改和全文搜索的情况下,你在原始问题中发布的查询是否运行得更快或更慢? - Neville Kuyt
@NevilleK 它运行得更快,并且作为最内层的子查询仍然存在于更新后的问题中。 - Cobra_Fast
3个回答

1

全文搜索比使用标准的SQL字符串比较功能要快得多。

其次,如果您的关键词存在高度冗余,可以考虑“多对多”实现:

Keywords
--------
keyword_id
keyword

keyword_object
-------------
keyword_id
object_id

objects
-------
object_id
......

如果将字符串比对从3900万行缩减到100K行(大约是英语词典的大小),你可能会看到明显的改善,因为查询只需要执行100K次字符串比对,而在整数keyword_id和object_id字段上进行连接应该比执行3900万次字符串比对要快得多。

这样不会让 keyword_object 表留下 3900 万行数据吗? - Cobra_Fast
1
是的,它可以 - 但你当前查询中昂贵的部分几乎肯定是字符串比较。我已经更新了答案来解释。 - Neville Kuyt
好的,我现在已经对这个解决方案进行了一些实验,我现在可以可靠地在10-20秒内获得结果,这是一个显着的改进,但仍然不如我想要的快。此外,我的关键字表最终有了1M行,而不是100K。让我感到疑惑的一个想法是,谷歌是如何在不到十分之一秒的时间内扫描数万亿条记录的。 - Cobra_Fast
从常规索引切换到全文索引,并调整查询以利用它,似乎没有任何区别。MATCH (kwd.keyword) AGAINST ('+%s' IN BOOLEAN MODE)。仍然需要大约10秒钟的时间。 - Cobra_Fast

0

最好的解决方案是使用FULLTEXT搜索,但您可能需要一个MyISAM表。您可以设置一个镜像表,并使用一些事件和触发器更新它,或者如果您有一个从服务器复制的从属,您可以将其表更改为MyISAM并用于搜索。

对于这个查询,我能想到的唯一办法就是将其重写为:

SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'

我不确定这种方法是否会更快。

另外,您需要在object_id上建立索引(假设您的主键是(key_word, object_id))。


0
如果您很少进行INSERT操作,而经常进行SELECT操作,那么您可以针对读取优化数据,即重新计算每个关键字的object_ids数量,并直接将其存储在数据库中。这样,SELECT操作将非常快速,但是INSERT操作可能需要几秒钟的时间。

喜欢每个关键词只出现一次,并将object_id列表分配给它?那该怎么做呢? - Cobra_Fast

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接