Correlated semi-join subqueries and PostgreSQL

The work on subquery optimizations goes on, and we’re already curious how the assortment of new 6.0 subquery optimizations compares to what is found in other DBMSes. MySQL’s new strategies are not DBMS theory breakthroughs, similar ideas have been implemented in other DBMSes, but when you take this list of strategies and put them into this kind of optimizer, the result has no direct equivalents in any other DBMS (or at least we believe so).

The first thing we did was to take a look at PostgreSQL as it is easily available and seems to have at least decent subquery handling (or even better than decent, I have not heard much complaints). And the first interesting difference was handling of correlated subqueries. With exception of materialization, MySQL’s new subquery strategies do not care if the subquery is correlated or not.

For example, let’s look at table pullout: First let’s try an uncorrelated subquery:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2 < 20);
+----+-------------+-------+------+---------------+-----------+---------+------------------+------+-------------+
| id | select_type | table | type | possible_keys | key       | key_len | ref              | rows | Extra       |
+----+-------------+-------+------+---------------+-----------+---------+------------------+------+-------------+
|  1 | PRIMARY     | tbl2  | ALL  | PRIMARY       | NULL      | NULL    | NULL             |  100 | Using where |
|  1 | PRIMARY     | tbl1  | ref  | tbl1_col1     | tbl1_col1 | 5       | tbl2.primary_key |    1 |             |
+----+-------------+-------+------+---------------+-----------+---------+------------------+------+-------------+
mysql> show warnings\G
…
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1` 
where ((`test2`.`tbl1`.`col1` = `test2`.`tbl2`.`primary_key`) and (`test2`.`tbl2`.`col2` < 20))

and we see that it has been converted into an inner join (indicated by the part marked in red). Now let’s make the subquery correlated:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where
    -> col2 < 20 and tbl2.col2 = tbl1.col2);
+----+-------------+-------+--------+---------------+---------+---------+-----------+------+----------------------------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref       | rows | Extra                            |
+----+-------------+-------+--------+---------------+---------+---------+-----------+------+----------------------------------+
|  1 | PRIMARY     | tbl1  | range  | col1,col2     | col2    | 5       | NULL      |   20 | Using index condition; Using MRR |
|  1 | PRIMARY     | tbl2  | eq_ref | PRIMARY       | PRIMARY | 4       | tbl1.col1 |    1 | Using where                      |
+----+-------------+-------+--------+---------------+---------+---------+-----------+------+----------------------------------+
mysql> show warnings\G
…
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1`
where ((`test2`.`tbl2`.`primary_key` = `test2`.`tbl1`.`col1`) and (`test2`.`tbl2`.`col2` = `test2`.`tbl1`.`col2`) and
 (`test2`.`tbl1`.`col2` < 20))

The query execution plans are different (because of the extra equality) but we see that subquery was converted into join in both cases.

Now let’s try PostgreSQL: first, uncorrelated subquery:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2 < 20);
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Merge IN Join  (cost=0.00..100.11 rows=1 width=8)
   Merge Cond: (tbl1.col1 = tbl2.primary_key)
   ->  Index Scan using tbl1_col1 on tbl1  (cost=0.00..56.66 rows=600 width=8)
   ->  Index Scan using tbl2_pkey on tbl2  (cost=0.00..503.25 rows=12 width=4)
         Filter: (col2 < 20)

Ok, a decent plan. Now, let’s make the subquery correlated:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where
psergey(# col2 < 20 and tbl2.col2 = tbl1.col2);
                          QUERY PLAN
--------------------------------------------------------------
 Seq Scan on tbl1  (cost=0.00..60014.25 rows=300 width=8)
   Filter: (subplan)
   SubPlan
     ->  Seq Scan on tbl2  (cost=0.00..200.00 rows=1 width=4)
           Filter: ((col2 < 20) AND (col2 = $0))

and we can see that PostgreSQL starts using the “naive” approach, where the subquery predicate is evaluated using a direct, straightforward fashion and is not used for any optimizations.

So the first impression is that MySQL kinda wins this round. But wait, if you look at user subquery case analysis here, you’ll see that the vast majority of semi-join subqueries are uncorrelated. I wonder if PG folks at some point in the past have made a conscious decision not to bother optimizing correlated subqueries.

For uncorrelated subqueries, it’s different. Our first impression is that PostgreSQL has a powerful tool - hash join and its subquery variants, which it uses to get decent performance in wide variety of cases. MySQL does not have hash join (Hi Timour? Kaamos?), but the new subquery optimizations can have kinda-similar behavior. It is not clear yet whether they’ll be a good subquery hash join substitute. I intend to post the analysis here once I figure it out. Stay tuned.

Posted in subqueries, benchmarks on March 24th, 2008 by spetrunia | |

8 Responses to ' Correlated semi-join subqueries and PostgreSQL '

Subscribe to comments with RSS or TrackBack to ' Correlated semi-join subqueries and PostgreSQL '.

  1. G said,

    on March 24th, 2008 at 1:07 pm

    I take it you can only do this when you can prove the columns being selected in the subquery are all not NULL?

  2. Scott Marlowe said,

    on March 25th, 2008 at 5:54 am

    What does explain analyze select … say for postgresql? What version of postgresql? And were the tables analyzed before the query?

    There were some recent optimizations made for correlated subqueries (some are still in CVS I believe so you’d need a recent snapshot to really see if that helps as well).

    Oh, and that NULL thing G mentioned. Were the pgsql tables joined on not null fields? I wonder how well each db would work on correlated subqueries with NULLable fields…

  3. Gurjeet Singh said,

    on March 26th, 2008 at 9:27 am

    Thats surprising (about Postgres)!!! Can you post/point me to the schema and data?

  4. sergey said,

    on March 26th, 2008 at 3:47 pm

    G> I take it you can only do this when you can prove the columns being selected in the subquery are all not NULL?

    Not necessarily. When the subquery is an AND-part of the WHERE clause (which is a practically important case), we don’t care whether the subquery predicate returns NULL or FALSE and so can perform the rewrite no matter if subquery’s select list is NULL-able or not.

    I’ve tried the correlated example with all columns defined as NOT NULL and got the same query plan (albeit with different cost).

  5. sergey said,

    on March 26th, 2008 at 3:49 pm

    Scott, Gurjeet: I’ll collect the info and post it here.

    Scott, if it’s easy for you, could you point to the recent PG optimizations you’ve referred to?


  6. on March 28th, 2008 at 9:30 pm

    […] Here’s a useful technical post from Sergey Petrunia on correlated semi-join subqueries and PostgreSQL […]

  7. what do braces cost said,

    on April 29th, 2013 at 4:25 pm

    what do braces cost…

    Sergey Petrunia’s blog » Correlated semi-join subqueries and PostgreSQL…

  8. simply click the next web page said,

    on May 13th, 2013 at 1:03 am

    simply click the next web page…

    Sergey Petrunia’s blog » Correlated semi-join subqueries and PostgreSQL…

Leave a reply

You must be logged in to post a comment.