More MySQL 6.0 news: next subquery optimizations WL pushed

Three days ago I’ve finally managed to push the code for WL#3985 “Subquery optimization: smart choice between semi-join and materialization” into MySQL 6.0. I missed the clone-off date so it won’t be in the upcoming MySQL 6.0.9 release, the only way to get it before the next 6.0.10 release is from the lp:mysql-server/6.0 bazaar repository.

What’s new in the push

Before WL#3985, 6.0’s subquery optimization had these three deficiencies:

  1. For semi-join (see cheatsheet for definition) subqueries, you had to make a choice between having the optimizer use materialization or all other strategies. The default behavior was not to use materialization, you could only get it by setting a server variable to disable all other strategies.
  2. The choice among other strategies (FirstMatch, DuplicateWeedout, LooseScan) wasn’t very intelligent - roughly speaking, the optimizer would first pick a join order as if there were only inner joins, and then remember that some of them are actually semi-joins and try to find how it can resolve semi-joins with the picked join order.
  3. Materialization only worked in the outer-to-inner fashion, that is, if you got a query like
    select * from people where name in (select owner from aircraft)
    it would still scan the people and make lookups into a temporary table of aircraft owners. It was not possible to make it scan the temptable of aircraft owners and make lookups into people.

WL#3985 fully addresses #1 and #2, and partially addresses #3. That is, now

  • Semi-join subqueries can use Materialization in an inner-to-outer fashion
  • Join optimizer is aware of existence of semi-joins and makes a fully automatic, cost-based choice between FirstMatch, DuplicateWeedout, LooseScan, inner-to-outer and outer-to-inner variants of Materialization.

This is expected to be a considerable improvement. The most common class of subqueries,

SELECT ... WHERE expr IN (SELECT ... w/o GROUPing/UNIONs/etc) AND ...

is now covered by a reasonably complete set of execution strategies and the optimizer is expected to have the capability to choose a good strategy for every case.

Possible gotchas, and we’re looking for input

I can’t state that the subquery optimizer does have the capability to pick a good plan because we haven’t done much experiments with the subquery cost model yet. We intend to do some benchmarking, but will also very much appreciate any input on how does the subquery optimizer behave for real-world queries. The code should be reasonably stable now - there are only three known problems, all of which are not very popular edge cases:

  • LEFT JOINs. You may get wrong query results when the subquery or parent subquery use left joins.
  • “Grandparent” correlation. A query with a semi-join child subquery which has a semi-join grandchild subquery which refers to a column in the top-level select may produce wrong query plans/results under certain circumstances.
  • Different datatypes. You may get wrong query results of queries that have col1 IN (SELECT col2) where col1 and col2 are of different types (which should not happen too often in practice)

If you have subqueries with LEFT JOINs, please let us know also, because so far all LEFT JOIN+subquery cases we have were generated by the random query generator, certain properties of MySQL codebase make it difficult to make outer joins work with semi-joins, and if we don’t get any real-world LEFT JOIN examples, chances are we will disable subquery optimizations if there’s LEFT JOIN in the parent select, or in the subquery, or in either case.

Posted in subqueries on January 1st, 2009 by spetrunia | | 3 Comments

Slides: New subquery optimizations in MySQL 6.0 (new revision)

I’m now at MySQL Conference and have just finished my New Subquery Optimizations in MySQL 6.0 talk. If you’re eager to get the slides, here is the link to the pdf file and a mirror.

The talk has the same name as this MySQL University session but the slides have been thoroughly re-worked, and also there is new content:

  • More detailed explanation of subquery handling in 4.1/5.x
  • More reliable/justified benchmark numbers
  • Observations about subquery demographics
  • An attempt at comparative analysis of how MySQL’s subquery strategies compare to PostgreSQL’s
    • And a slide #28 that explains why we’re looking at PostgreSQL all the time

Have a nice viewing. And, if you get subqueries that should be fast but aren’t, we’re now actively looking at every subquery performance bug we get and will appreciate any input.

Posted in subqueries, how-it-works, benchmarks on April 16th, 2008 by spetrunia | | 0 Comments

Correlated semi-join subqueries and PostgreSQL

The work on subquery optimizations goes on, and we’re already curious how the assortment of new 6.0 subquery optimizations compares to what is found in other DBMSes. MySQL’s new strategies are not DBMS theory breakthroughs, similar ideas have been implemented in other DBMSes, but when you take this list of strategies and put them into this kind of optimizer, the result has no direct equivalents in any other DBMS (or at least we believe so).

The first thing we did was to take a look at PostgreSQL as it is easily available and seems to have at least decent subquery handling (or even better than decent, I have not heard much complaints). And the first interesting difference was handling of correlated subqueries. With exception of materialization, MySQL’s new subquery strategies do not care if the subquery is correlated or not.

For example, let’s look at table pullout: First let’s try an uncorrelated subquery:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2 < 20);
| id | select_type | table | type | possible_keys | key       | key_len | ref              | rows | Extra       |
|  1 | PRIMARY     | tbl2  | ALL  | PRIMARY       | NULL      | NULL    | NULL             |  100 | Using where |
|  1 | PRIMARY     | tbl1  | ref  | tbl1_col1     | tbl1_col1 | 5       | tbl2.primary_key |    1 |             |
mysql> show warnings\G
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1` 
where ((`test2`.`tbl1`.`col1` = `test2`.`tbl2`.`primary_key`) and (`test2`.`tbl2`.`col2` < 20))

and we see that it has been converted into an inner join (indicated by the part marked in red). Now let’s make the subquery correlated:

mysql> explain select * from tbl1 where col1 in (select primary_key from tbl2 where
    -> col2 < 20 and tbl2.col2 = tbl1.col2);
| id | select_type | table | type   | possible_keys | key     | key_len | ref       | rows | Extra                            |
|  1 | PRIMARY     | tbl1  | range  | col1,col2     | col2    | 5       | NULL      |   20 | Using index condition; Using MRR |
|  1 | PRIMARY     | tbl2  | eq_ref | PRIMARY       | PRIMARY | 4       | tbl1.col1 |    1 | Using where                      |
mysql> show warnings\G
Message: select `test2`.`tbl1`.`col1` AS `col1`,`test2`.`tbl1`.`col2` AS `col2` from (`test2`.`tbl2`) join `test2`.`tbl1`
where ((`test2`.`tbl2`.`primary_key` = `test2`.`tbl1`.`col1`) and (`test2`.`tbl2`.`col2` = `test2`.`tbl1`.`col2`) and
 (`test2`.`tbl1`.`col2` < 20))

The query execution plans are different (because of the extra equality) but we see that subquery was converted into join in both cases.

Now let’s try PostgreSQL: first, uncorrelated subquery:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where col2 < 20);
                                  QUERY PLAN
 Merge IN Join  (cost=0.00..100.11 rows=1 width=8)
   Merge Cond: (tbl1.col1 = tbl2.primary_key)
   ->  Index Scan using tbl1_col1 on tbl1  (cost=0.00..56.66 rows=600 width=8)
   ->  Index Scan using tbl2_pkey on tbl2  (cost=0.00..503.25 rows=12 width=4)
         Filter: (col2 < 20)

Ok, a decent plan. Now, let’s make the subquery correlated:

psergey=# explain select * from tbl1 where col1 in (select primary_key from tbl2 where
psergey(# col2 < 20 and tbl2.col2 = tbl1.col2);
                          QUERY PLAN
 Seq Scan on tbl1  (cost=0.00..60014.25 rows=300 width=8)
   Filter: (subplan)
     ->  Seq Scan on tbl2  (cost=0.00..200.00 rows=1 width=4)
           Filter: ((col2 < 20) AND (col2 = $0))

and we can see that PostgreSQL starts using the “naive” approach, where the subquery predicate is evaluated using a direct, straightforward fashion and is not used for any optimizations.

So the first impression is that MySQL kinda wins this round. But wait, if you look at user subquery case analysis here, you’ll see that the vast majority of semi-join subqueries are uncorrelated. I wonder if PG folks at some point in the past have made a conscious decision not to bother optimizing correlated subqueries.

For uncorrelated subqueries, it’s different. Our first impression is that PostgreSQL has a powerful tool - hash join and its subquery variants, which it uses to get decent performance in wide variety of cases. MySQL does not have hash join (Hi Timour? Kaamos?), but the new subquery optimizations can have kinda-similar behavior. It is not clear yet whether they’ll be a good subquery hash join substitute. I intend to post the analysis here once I figure it out. Stay tuned.

Posted in subqueries, benchmarks on March 24th, 2008 by spetrunia | | 7 Comments

Slides: New subquery optimizations in MySQL 6.0

A bunch of (hopefully) self-explanatory slides about new subquery optimizations in MySQL 6.0 is available here (UPDATE: here’s a working link). The slides are from this MySQL University session, so there was an audio stream but there were some difficulties with it and it is not available now.

MySQL Conference & Expo 2008