iis服务器助手

扫码关注官方微信

扫码下载APP

返回顶部

首页 > 资讯 > 数据库 >MySQL 8.0 hash join有重大缺陷？

277

0

分享到

MySQL 8.0 hash join有重大缺陷？

MySQL 8.0 hash join有重大缺陷？ 2017-11-25 10:11:52 277人浏览猪猪侠

摘要

徐春阳老师发文爆Mysql 8.0 hash join有重大缺陷。文章核心观点如下：多表（比如3个个表）join时，只会简单的把表数据量小的放在前面作为驱动表，大表放在最后面，从而导致可能产生极大结果集的笛卡尔积，甚至耗尽CPU和磁盘空

MySQL 8.0 hash join有重大缺陷？

徐春阳老师发文爆Mysql 8.0 hash join有重大缺陷。

文章核心观点如下：多表（比如3个个表）join时，只会简单的把表数据量小的放在前面作为驱动表，大表放在最后面，从而导致可能产生极大结果集的笛卡尔积，甚至耗尽CPU和磁盘空间。

就此现象，我也做了个测试。

1. 利用TPC-H工具准备测试环境

TPC-H工具在这里下载 Http://www.tpc.org/tpch/default5.asp。默认并不支持mysql，需要自己手动做些调整，参见 https://imysql.com/2012/12/21/tpch-for-mysql-manual.html。

在本案中，我指定的 Scale Factor 参数是10，即：

[root@yejr.run dbgen]# ./dbgen -s 10 && ls -l *tbl
-rw-r--r-- 1 root root  244847642 Apr 14 09:52 customer.tbl
-rw-r--r-- 1 root root 7775727688 Apr 14 09:52 lineitem.tbl
-rw-r--r-- 1 root root       2224 Apr 14 09:52 nation.tbl
-rw-r--r-- 1 root root 1749195031 Apr 14 09:52 orders.tbl
-rw-r--r-- 1 root root  243336157 Apr 14 09:52 part.tbl
-rw-r--r-- 1 root root 1204850769 Apr 14 09:52 partsupp.tbl
-rw-r--r-- 1 root root        389 Apr 14 09:52 region.tbl
-rw-r--r-- 1 root root   14176368 Apr 14 09:52 supplier.tbl

2. 创建测试表，导入测试数据。

查看几个表的数据量分别是：

+----------+------------+----------+----------------+-------------+--------------+
| Name     | Row_fORMat | Rows     | Avg_row_length | Data_length | Index_length |
+----------+------------+----------+----------------+-------------+--------------+
| customer | Dynamic    |  1476605 |            197 |   291258368 |            0 |
| lineitem | Dynamic    | 59431418 |            152 |  9035579392 |            0 |
| nation   | Dynamic    |       25 |            655 |       16384 |            0 |
| orders   | Dynamic    | 14442405 |            137 |  1992294400 |            0 |
| part     | Dynamic    |  1980917 |            165 |   327991296 |            0 |
| partsupp | Dynamic    |  9464104 |            199 |  1885339648 |            0 |
| region   | Dynamic    |        5 |           3276 |       16384 |            0 |
| supplier | Dynamic    |    99517 |            184 |    18366464 |            0 |
+----------+------------+----------+----------------+-------------+--------------+

提醒：几个测试表都不要加任何索引，包括主键，上表中 Index_length 的值均为0。

3. 运行测试SQL

本案选用的MySQL版本是8.0.19：

[root@yejr.run]> s
...
Server version:         8.0.19-commercial MySQL Enterprise Server - Commercial
...

徐老师是在用TPC-H中的Q5时遇到的问题，本案也同样选择这个SQL。

不过，本案主要测试Hash Join，因此去掉了其中的GROUP BY和ORDER BY子句。

先看下执行计划吧，都是全表扫描，好可怕...

[root@yejr.run]> desc select count(*)
-> from
->     customer,
->     orders,
->     lineitem,
->     supplier,
->     nation,
->     region
-> where
->     c_custkey = o_custkey
->     and l_orderkey = o_orderkey
->     and l_suppkey = s_suppkey
->     and c_nationkey = s_nationkey
->     and s_nationkey = n_nationkey
->     and n_regionkey = r_regionkey
->     and r_name = "AMERICA"
->     and o_orderdate >= date "1993-01-01"
->     and o_orderdate < date "1993-01-01" + interval "1" year;
+----------+------+----------+----------+----------------------------------------------------+
| table    | type | rows     | filtered | Extra                                              |
+----------+------+----------+----------+----------------------------------------------------+
| region   | ALL  |        5 |    20.00 | Using where                                        |
| nation   | ALL  |       25 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
| supplier | ALL  |    98705 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
| customer | ALL  |  1485216 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
| orders   | ALL  | 14932433 |     1.11 | Using where; Using join buffer (Block Nested Loop) |
| lineitem | ALL  | 59386314 |     1.00 | Using where; Using join buffer (Block Nested Loop) |
+----------+------+----------+----------+----------------------------------------------------+

加上 format=tree 再看下（真壮观啊。。。）

*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
-> Inner hash join (lineitem.L_SUPPKEY = supplier.S_SUPPKEY), (lineitem.L_ORDERKEY = orders.O_ORDERKEY)  (cost=40107736685515472896.00 rows=4010763818487343104)
    -> Table scan on lineitem  (cost=0.07 rows=59386314)
    -> Hash
        -> Inner hash join (orders.O_CUSTKEY = customer.C_CUSTKEY)  (cost=60799566599072.12 rows=6753683238538)
            -> Filter: ((orders.O_ORDERDATE >= DATE"1993-01-01") and (orders.O_ORDERDATE < ((DATE"1993-01-01" + interval "1" year))))  (cost=0.16 rows=165883)
                -> Table scan on orders  (cost=0.16 rows=14932433)
            -> Hash
                -> Inner hash join (customer.C_NATIONKEY = nation.N_NATIONKEY)  (cost=3664985889.79 rows=3664956624)
                    -> Table scan on customer  (cost=0.79 rows=1485216)
                    -> Hash
                        -> Inner hash join (supplier.S_NATIONKEY = nation.N_NATIONKEY)  (cost=24976.50 rows=24676)
                            -> Table scan on supplier  (cost=513.52 rows=98705)
                            -> Hash
                                -> Inner hash join (nation.N_REGIONKEY = region.R_REGIONKEY)  (cost=3.50 rows=3)
                                    -> Table scan on nation  (cost=0.50 rows=25)
                                    -> Hash
                                        -> Filter: (region.R_NAME = "AMERICA")  (cost=0.75 rows=1)
                                            -> Table scan on region  (cost=0.75 rows=5)

看起来的确是把最小的表放在最前面，把最大的放在最后面。

在开始跑之前，我们先看一眼手册中关于Hash Join的描述，其中有一段是这样的：

Memory usage by hash joins can be controlled using the join_buffer_size
system variable; a hash join cannot use more memory than this amount. 
When the memory required for a hash join exceeds the amount available, 
MySQL handles this by using files on disk. If thishappens, you should 
be aware that the join may not succeed if a hash join cannot fit into 
memory and it creates more files than set for open_files_limit. To avoid 
such problems, make either of the following changes:

- Increase join_buffer_size so that the hash join does not spill over to disk.
- Increase open_files_limit.

简言之，当 join_buffer_size 不够时，会在hash join的过程中转储大量的磁盘表（把一个hash表切分成多个小文件放在磁盘上，再逐个读入内存进行hash join），因此建议加大 join_buffer_size，或者加大 open_files_limit 上限。

所以，正式开跑前，我先把join_buffer_size调大到1GB，并顺便看下其他几个参数值：

[root@yejr.run]> select @@join_buffer_size,  @@tmp_table_size,  @@innodb_buffer_pool_size;
+--------------------+------------------+---------------------------+
| @@join_buffer_size | @@tmp_table_size | @@innodb_buffer_pool_size |
+--------------------+------------------+---------------------------+
|         1073741824 |         16777216 |               10737418240 |
+--------------------+------------------+---------------------------+

并且为了保险起见，在执行SQL时也用 SET_VAR（8.0新特性）设置了 join_bufer_size，走起。

好在最后这个SQL有惊无险的执行成功，总耗时2911秒。

# Query_time: 2911.426483  Lock_time: 0.000251 Rows_sent: 1  Rows_examined: 76586082

当然了，这个SQL执行过程中的代价也确实非常大，产生了大量的磁盘（不可见）临时文件。

我每隔几秒钟就统计一次所有临时文件的总大小，并且观察磁盘空间剩余量。

/data 分区最开始可用空间是 373GB，这条SQL在峰值吃掉了约170GB，着实可怕。

# 刚开始
/dev/vdb       524032000 132967368 391064632  26% /data

# 峰值时
/dev/vdb       524032000 319732288 204299712  62% /data

CPU的负载从监控上看倒是还算能接受，最高约38.4%

4. 补充测试

上面的测试中，优化器"擅自"修改了驱动顺序，那加上straight_join看看会怎样

[root@yejr.run]> EXPLAIN STRAIGHT_JOIN select count(*)
from
    customer straight_join 
    orders  straight_join 
    lineitem  straight_join 
    supplier  straight_join 
    nation  straight_join 
    region
where
    c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and l_suppkey = s_suppkey
    and c_nationkey = s_nationkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = "AMERICA"
    and o_orderdate >= date "1993-01-01"
    and o_orderdate < date "1993-01-01" + interval "1" year;
+----------+----------+----------+----------------------------------------------------+
| table    | rows     | filtered | Extra                                              |
+----------+----------+----------+----------------------------------------------------+
| customer |  1485216 |   100.00 | NULL                                               |
| orders   | 14932433 |     1.11 | Using where; Using join buffer (Block Nested Loop) |
| lineitem | 59386314 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
| supplier |    98705 |     1.00 | Using where; Using join buffer (Block Nested Loop) |
| nation   |       25 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
| region   |        5 |    20.00 | Using where; Using join buffer (Block Nested Loop) |
+----------+----------+----------+----------------------------------------------------+

#format=tree模式下
| -> Aggregate: count(0)
    -> Inner hash join (region.R_REGIONKEY = nation.N_REGIONKEY)  (cost=204565289351994015744.00 rows=8021527039324357632)
        -> Filter: (region.R_NAME = "AMERICA")  (cost=0.00 rows=1)
            -> Table scan on region  (cost=0.00 rows=5)
        -> Hash
            -> Inner hash join (nation.N_NATIONKEY = customer.C_NATIONKEY)  (cost=200554431911464173568.00 rows=-9223372036854775808)
                -> Table scan on nation  (cost=0.00 rows=25)
                -> Hash
                    -> Inner hash join (supplier.S_NATIONKEY = customer.C_NATIONKEY), (supplier.S_SUPPKEY = lineitem.L_SUPPKEY)  (cost=160446786739199049728.00 rows=-9223372036854775808)
                        -> Table scan on supplier  (cost=0.00 rows=98705)
                        -> Hash
                            -> Inner hash join (lineitem.L_ORDERKEY = orders.O_ORDERKEY)  (cost=16253562153466286.00 rows=16253535510797654)
                                -> Table scan on lineitem  (cost=0.01 rows=59386314)
                                -> Hash
                                    -> Inner hash join (orders.O_CUSTKEY = customer.C_CUSTKEY)  (cost=24638698342.46 rows=2736915995)
                                        -> Filter: ((orders.O_ORDERDATE >= DATE"1993-01-01") and (orders.O_ORDERDATE < ((DATE"1993-01-01" + interval "1" year))))  (cost=0.94 rows=165883)
                                            -> Table scan on orders  (cost=0.94 rows=14932433)
                                        -> Hash
                                            -> Table scan on customer  (cost=153126.35 rows=1485216)

最后实际执行耗时

[root@yejr.run]> mysql> select 
 STRAIGHT_JOIN count(*)
...
+----------+
| count(*) |
+----------+
|    72033 |
+----------+
1 row in set (4 min 12.31 sec)

这个SQL执行过程中，只产生了很少几个临时文件，影响几乎可以忽略不计的那种。

这次之所以会比较快，是因为 orders 表在第二顺序执行，对它还附加了WHERE条件，过滤后数据量变小了（全表1500万，过滤后227万），因此整体执行时间缩短了。

靠着 straight_join 拯救了危机。

此外，在测试的过程中，我还做过一次只有3个表的全表join，下面是执行计划

[root@yejr.run]> desc select count(*) from orders o , lineitem l, partsupp ps where
o.O_CUSTKEY = l.L_SUPPKEY and l.L_PARTKEY = ps.PS_AVAILQtY;
+-------+----------+----------+----------------------------------------------------+
| table | rows     | filtered | Extra                                              |
+-------+----------+----------+----------------------------------------------------+
| ps    |  7697248 |   100.00 | NULL                                               |
| l     | 59386314 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
| o     | 14932433 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
+-------+----------+----------+----------------------------------------------------+

在这个执行计划中，就不会出现徐老师说的问题，不再简单的把最小的表作为驱动表，最大的表放在最后面。

这条SQL耗时304秒，还好吧。

# Query_time: 304.889654  Lock_time: 0.000178 Rows_sent: 1  Rows_examined: 82986052

写在最后

在前几天我的文章《MySQL没前途了吗？》中，其实已经说了MySQL目前不适合做OLAP业务，即便有Hash Join也不行，毕竟其适用的场景很有限。

本案中几个表完全没任何索引，这属于很极端的场景，不应该允许此类现象发生。

另外，在已经明确需要走Hash Join的情况下，就应该人为干预，提前加大join_buffer_size，减少执行过程中生成的临时文件。

当然了，如果遇到多表JOIN不符合预期时，还可以用STRAIGHT_JOIN强制设定驱动顺序，也可以规避这个问题。

不过，MySQL在偏OLAP场景上的性能的确还有很大提升空间，对此我持谨慎乐观态度，比如把ClickHouse给直接收编了呢：）

对于本文，我心里不是很有底气，毕竟不是啥源码大神，如果理解上的错误，还请留言指正，不吝感激。

SQL优化大神郑松华对本文亦有贡献，谢谢二位老师。

全文完。

由我主讲的知数堂「MySQL优化课」第17期已发车，我们的课程从第15期就升级成MySQL 8.0版本了，现在上车刚刚好，一起开启MySQL 8.0的修行之旅吧

另外，叶老师在腾讯课堂《MySQL性能优化》精编版第一期已完结，本课程讲解读几个MySQL性能优化的核心要素：合理利用索引，降低锁影响，提高事务并发度。

下面是自动拼团的链接，组团价仅需78元

https://ke.qq.com/course/479779?from=800004099&tuin=47bb23#term_id=100575214

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: MySQL 8.0 hash join有重大缺陷？

本文链接: https://www.lsjlt.com/news/5806.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

猜你喜欢

MySQL 8.0 新特性之哈希连接（Hash Join）

MySQL 开发组于 2019 年 10 月 14 日正式发布了 MySQL 8.0.18 GA 版本，带来了一些新特性和增强功能。其中最引人注目的莫过于多表连接查询支持 hash join 方式了。我们...

99+

2024-04-02
MySQL 8.0的重点都有哪些

MySQL 8.0的重点都有哪些，很多新手对此不是很清楚，为了帮助大家解决这个难题，下面小编将为大家详细讲解，有这方面需求的人可以来学习下，希望你能有所收获。一、关于MySQL Server的改进1.1 r...

99+

2024-04-02
MySQL自适应哈希索引的特点和缺陷有哪些

这篇文章主要讲解了“MySQL自适应哈希索引的特点和缺陷有哪些”，文中的讲解内容简单清晰，易于学习与理解，下面请大家跟着小编的思路慢慢深入，一起来研究和学习“MySQL自适应哈希索引的特点和缺陷有哪些”吧！...

99+

2024-04-02
MySQL重大新增的功能有哪些

这篇文章主要讲解了“MySQL重大新增的功能有哪些”，文中的讲解内容简单清晰，易于学习与理解，下面请大家跟着小编的思路慢慢深入，一起来研究和学习“MySQL重大新增的功能有哪些”吧！一、MySQL的天然短板...

99+

2024-04-02

软考高级职称资格查询

iis服务器助手

软考职称历年真题下载

2023下半年-信息系统项目管理师-真题考点汇总（完整版）
164.2 KB 查看
2023年下半年信息系统项目管理师第一、二批次各科目真题考点整理(考友回忆版)
143.67 KB 查看
2023上半年软考高级《信息系统项目管理师》真题答案（抢先版）
500.26 KB 查看
2022年下半年软考高级职称考试考情分析
823.36 KB 查看
2022年下半年软考高级职称考试真题
569.84 KB 查看

软考职称资料下载

热门wiki

mysql删除数据恢复

mysql删表能回滚吗

mysql找回删除的表

mysql不小心删除了表

mysql不小心把表删了怎么恢复数据

mysql数据表删除后能恢复么

mysql误删表数据恢复

mysql误删表恢复

mysql删除表怎么恢复

近期文章

mysql拒绝访问怎么办

mysql怎么比较日期大小

mysql怎么加锁

mysql误删数据怎么恢复

怎么判断mysql安装成功

mysql怎么修改表名

mysql删除的表怎么恢复

mysql复合主键怎么写

怎么查看mysql数据库版本

怎么检测mysql安装成功

推荐阅读

mysql怎么比较日期大小

2024-05-16

mysql怎么加锁

2024-05-16

mysql删除的表怎么恢复

2024-05-16

sql中exists具体用法

2024-05-15

patindex在sql中的用法

2024-05-15

sql中逻辑运算符的用法

2024-05-15

regexp在sql中的用法

2024-05-15

sql中split函数用法

2024-05-15

sql中leftjoin的用法

2024-05-15

sql中concat用法

2024-05-15

热门问答

1

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

1

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

1

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

1

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

1

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

1

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

1

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

1

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

1

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

1

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

编程网，编程工程师的家园，是目前国内优秀的开源技术社区之一，形成了由开源软件库、代码分享、资讯、协作翻译、讨论区和博客等几大频道内容，为IT开发者提供了一个发现、使用、并交流开源技术的平台。

官方手机版
微信公众号
商务合作

Powered by 编程网 | Copyright © 2018-2023, 版权所有. | 网站地图 | 苏ICP备17033115号