优享资讯 | PostgreSQL 提升子连接与 Oracle 子查询非嵌套

掘金后端 ( ) • 2024-05-05 16:58

查询优化器对子查询一般采用嵌套执行的方式，即父查询中的每一行，都要执行一次子查询，这样子查询会执行很多次，效率非常低。例如 exists、not exists 逐行取出经行匹配处理，项目中使用子查询的地方非常多，如何写出高效的sql，掌握子查询的优化是非常有必要的。

一、需要了解的概念：

PostgreSQL数据库基于子查询所在的位置和作用的不同，将子查询细分成了两类，一类称为子连接（SubLink），另一类称为子查询（SubQuery）。如何区分子连接和子查询？ 通常而言，如果它是以范围表的方式存在的，那么就称为子查询。

explain
select e1.*
from emp e1,
     (select * from emp where deptno = 10) e2
where e1.empno = e2.empno

Hash Join  (cost=1.21..2.40 rows=3 width=41)
  Hash Cond: (e1.empno = emp.empno)
  ->  Seq Scan on emp e1  (cost=0.00..1.14 rows=14 width=41)
  ->  Hash  (cost=1.18..1.18 rows=3 width=5)
        ->  Seq Scan on emp  (cost=0.00..1.18 rows=3 width=5)
              Filter: (deptno = '10'::numeric)

如果它以表达式的方式存在，那么就称为子连接。

-- 子链接1
explain
select e.empno, (select avg(sal) from emp e1 where e.deptno = e1.deptno)
from emp e

Seq Scan on emp e  (cost=0.00..17.94 rows=14 width=37)
  SubPlan 1
    ->  Aggregate  (cost=1.19..1.20 rows=1 width=32)
          ->  Seq Scan on emp e1  (cost=0.00..1.18 rows=5 width=5)
                Filter: (e.deptno = deptno)


-- 子链接2
explain
select *
from emp e1
where sal in (select sal from emp where e1.deptno = 10)

Seq Scan on emp e1  (cost=0.00..9.43 rows=7 width=41)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Result  (cost=0.00..1.14 rows=14 width=5)
          One-Time Filter: (e1.deptno = '10'::numeric)
          ->  Seq Scan on emp  (cost=0.00..1.14 rows=14 width=5)

一般情况下，子连接的执行效率往往是比子查询低很多，因为子连接是逐行处理，会产生filter，而PostgreSQL 优化器会在某些情况下对子连接进行提升为子查询，从而对filter操作进行消除，提升查询效率。

二、in 子连接

** in 子连接提升子查询写法：**

explain select * from emp where deptno in (select deptno from dept);

Hash Join  (cost=1.09..2.31 rows=14 width=41)
  Hash Cond: (emp.deptno = dept.deptno)
  ->  Seq Scan on emp  (cost=0.00..1.14 rows=14 width=41)
  ->  Hash  (cost=1.04..1.04 rows=4 width=5)
        ->  Seq Scan on dept  (cost=0.00..1.04 rows=4 width=5)

可以被提升，优化器相会内部重写成内连接。

explain select emp.* from emp inner join dept on emp.deptno = dept.deptno;

Hash Join  (cost=1.09..2.31 rows=14 width=41)
  Hash Cond: (emp.deptno = dept.deptno)
  ->  Seq Scan on emp  (cost=0.00..1.14 rows=14 width=41)
  ->  Hash  (cost=1.04..1.04 rows=4 width=5)
        ->  Seq Scan on dept  (cost=0.00..1.04 rows=4 width=5)

表明此 in 子连接被优化，优化后采用hash join算法。

in 子连接无法提升子查询写法：

-- 子连接包含谓词过滤写法，无法提升子查询
explain select * from emp e1 where sal in (select sal from emp where e1.deptno = 10);

Seq Scan on emp e1  (cost=0.00..9.43 rows=7 width=41)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Result  (cost=0.00..1.14 rows=14 width=5)
          One-Time Filter: (e1.deptno = '10'::numeric)
          ->  Seq Scan on emp  (cost=0.00..1.14 rows=14 width=5)

-- 使用 not in 也是无法提升子查询，not in与 <> all含义相同

explain select * from emp e1 where sal not in (select sal from emp );

Seq Scan on emp e1  (cost=1.18..2.35 rows=7 width=41)
  Filter: (NOT (hashed SubPlan 1))
  SubPlan 1
    ->  Seq Scan on emp  (cost=0.00..1.14 rows=14 width=5)

表明此 in 子连接未被优化，无法消除filter操作，只能逐行处理。

三、exists 子连接

** exists 子连接提升子查询写法：**

explain
select e.* from emp e where exists(select * from emp e2 where e.empno = e2.empno);

Hash Join  (cost=1.32..2.50 rows=14 width=41)
  Hash Cond: (e.empno = e2.empno)
  ->  Seq Scan on emp e  (cost=0.00..1.14 rows=14 width=41)
  ->  Hash  (cost=1.14..1.14 rows=14 width=5)
        ->  Seq Scan on emp e2  (cost=0.00..1.14 rows=14 width=5)


-- 当加入where e.deptno = 10 条件在子链接时，仍然支持上拉
explain
select e.*
from emp e
where exists(select * from emp e2 where e.deptno = 10)

Nested Loop Semi Join  (cost=0.00..2.45 rows=3 width=41)
  ->  Seq Scan on emp e  (cost=0.00..1.18 rows=3 width=41)
        Filter: (deptno = '10'::numeric)
  ->  Materialize  (cost=0.00..1.21 rows=14 width=0)
        ->  Seq Scan on emp e2  (cost=0.00..1.14 rows=14 width=0)

exists 子连接无法提升子查询写法：

explain
select e.*
from emp e
where exists(select sum(sal) from emp e2 where e.empno = e2.empno);

Seq Scan on emp e  (cost=0.00..17.80 rows=7 width=41)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Aggregate  (cost=1.18..1.19 rows=1 width=32)
          ->  Seq Scan on emp e2  (cost=0.00..1.18 rows=1 width=5)
                Filter: (e.empno = empno)


explain
select e.*
from emp e
where exists(select * from emp e2 where e2.deptno = 10);

Result  (cost=0.39..1.53 rows=14 width=41)
  One-Time Filter: $0
  InitPlan 1 (returns $0)
    ->  Seq Scan on emp e2  (cost=0.00..1.18 rows=3 width=0)
          Filter: (deptno = '10'::numeric)
  ->  Seq Scan on emp e  (cost=0.39..1.53 rows=14 width=41)


explain
select e.* from emp e where not exists(select 1 from emp e2 )

Result  (cost=0.08..1.22 rows=14 width=41)
  One-Time Filter: (NOT $0)
  InitPlan 1 (returns $0)
    ->  Seq Scan on emp e2  (cost=0.00..1.14 rows=14 width=0)
  ->  Seq Scan on emp e  (cost=0.08..1.22 rows=14 width=41)

in和exists都存在不被优化的可能，对于in和exists的选择，当父查询结果集小于子查询结果集则选择exists，如果父查询结果集大于子查询结果集选择in。

四、所有的all子链接都不支持上拉

explain
select * from emp where sal > all (select sal from emp e2);

Seq Scan on emp  (cost=0.00..9.89 rows=7 width=41)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Materialize  (cost=0.00..1.21 rows=14 width=5)
          ->  Seq Scan on emp e2  (cost=0.00..1.14 rows=14 width=5)


explain
select * from emp where sal = all (select sal from emp e2);

Seq Scan on emp  (cost=0.00..9.89 rows=7 width=41)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Materialize  (cost=0.00..1.21 rows=14 width=5)
          ->  Seq Scan on emp e2  (cost=0.00..1.14 rows=14 width=5)


explain
select * from emp where sal < all (select sal from emp e2);

Seq Scan on emp  (cost=0.00..9.89 rows=7 width=41)
  Filter: (SubPlan 1)
  SubPlan 1
    ->  Materialize  (cost=0.00..1.21 rows=14 width=5)
          ->  Seq Scan on emp e2  (cost=0.00..1.14 rows=14 width=5)

关于all的查询都都是以子连接的形式，不会上拉。 some和any是等效的，这里不做演示了。

五、join与子查询固化或rewrite

join或子查询的优化，属于优化器优化JOIN的范畴。
当用户的QUERY涉及到多个JOIN对象，或者涉及到多个子查询时，优化器可以选择是否改变当前的SQL，产生更多的plan选择更优的执行计划。
postgresql.conf文件中：
#from_collapse_limit = 8

当from列表的对象少于from_collapse_limit时，优化器可以将子查询提升到上层进行JOIN，从而可能选择到更优的执行计划。
#join_collapse_limit = 8               # 1 disables collapsing of explicit
                                       # JOIN clauses

当使用显示的JOIN时（除了full join），例如a join b join c join d，优化器可以重排JOIN的顺序，以产生更多的PLAN选择更优的执行计划。
如果join_collapse_limit=1，则不重排，使用SQL写法提供的顺序。
如果用户要固化JOIN顺序，请使用显示的JOIN，同时将join_collapse_limit设置为1。
如果用户不打算提升子查询，同样的，将from_collapse_limit 设置为1即可。

六、等价改写

子查询中没有group by子句，也没有聚集函数，则可使用下面的等价转换
val>all(select...)  to val>max(select...)
val<all(select...) to val<min(select...)
val>any(select...) to val>min(select...)
val<any(select...) to val<max(select...)
val>=all(select...) to val>=max(select...)
val<=all(select...) to val<=min(select...)
val>=any(select...) to val>=min(select...)
val<=any(select...) to val<=max(select...)
通常，聚集函数min(),max()的执行效率要比any、all效率高

七、ORACLE 的子查询非嵌套

子查询非嵌套（subquery Unnesting），查询转换技巧中强烈要求掌握的技能。 ** 当where子查询中有in、not in、exits、not exists等，CBO会尝试将子查询展开（unnest）,从而消除FILTER。这个过程叫作子查询非嵌套。子查询非嵌套的目的就是消除FILTER。** 当子查询语句含有exists或not exists时，子查询中有固化子查询关键词（union/union all / start connect by/rownum /cube / rollup）,那么执行计划就很容易产生FILTER。

八、结束语

1、postgresql子查询的优化思路，子查询不用执行多次 2、优化器可以根据统计信息来选择不同的连接方法和不同的连接顺序 3、子查询中的连接条件，过滤条件分别变成了父查询的连接条件、过滤条件、优化器可以对这些条件进行下推、提高执行效率 4、将子查询优化为表连接后，子查询只需要执行一次、而优化器可以根据统计信息来选择不同的连接方式和连接顺序、子查询的连接条件和过滤条件分别变成父查询的条件。 5、这些查询中all是完全不支持上拉子子链接的，in和exists存在不被优化的可能。 6、not exists虽然没有被上拉，但是被优化为只执行一次，相对于not in稍好。 7、可使用等价改写的方式优化8.可根据配置文件，固化子查询，以及表的连接顺序。 8、无论是PostgreSQL中的子连接提升查询还是ORACLE中的子查询非嵌套主要的目的就是优化子查询，消除filter。