No, Sheeri, MySQL 5.6 does not optimize subqueries away

Sheeri wrote a blog post that claims that “IN Subqueries in MySQL 5.6 Are Optimized Away” and uses that as a basis to conclude that subquery optimizations in MySQL 5.6 are superior to MariaDB’s.
The claim is incorrect (and so is the conclusion). As a person who has coded both of the mentioned FirstMatch and semi-join materialization, I think I need to write about this.

Sheeri wrote:

  1. “So MariaDB recognizes the subquery and optimizes it. But it is still optimized as a subquery”
  2. “In MySQL 5.6, the subquery is actually optimized away”

The first statement is somewhat true. The second one is not. I’ll try to explain. The example subquery Sheeri used was:

SELECT title FROM film WHERE film_id IN (SELECT film_id FROM film_actor)

Its meaning is “find films that have actors in table film_actor”. It is not possible to “optimize away” the subquery here. Not more than it’s possible to take the expression “2*3″ and optimize away the “*3″ part of it. The subquery affects the result of the whole query. You can’t remove it.

What the optimizer (both in MariaDB and in MySQL 5.6) does here is to convert the subquery into a semi-join. Semi-join is basically a “subquery in the WHERE clause” (read the link for more details), and it gives the optimizer more possible choices.
Semi-join can be seen in EXPLAIN EXTENDED. In both MariaDB and MySQL, one can see:

... select `film`.`title` AS `title`
    from `sakila`.`film` semi join (`sakila`.`film_actor`) ...

But what about different query plans? They do not show superiority of one optimizer over the other. As indicated in documentation, MariaDB supports the FirstMatch strategy that was the chosen by MySQL 5.6. Also, MySQL 5.6 supports semi-join materialization strategy that was picked by MariaDB. I suspect, different query plans were chosen because MariaDB and MySQL use different cost formulas. It is not possible to tell whose formulas are better, because both query plans finish in 0.01 seconds, and the tables are very small.

Which means, this example doesn’t allow one to conclude that MySQL’s subquery optimizations are superior to MariaDB (or vice versa) QED.

Addendum. Query plan used by MySQL will read *exactly* the same rows from the tables that MySQL 5.1/5.5 (which have no subquery optimizations) would read. The best you can deduce here is that MySQL 5.6’s subquery optimizations are not making things worse.

Posted in Uncategorized on January 30th, 2013 by spetrunia | | 6 Comments

MySQL@FOSDEM 2013: Call for papers closes in a few days!

This is just a reminder: Call for papers for MySQL & Friends devroom at FOSDEM 2013 is open until December 21st, which is this Friday. Please hurry up and submit your proposal. Possible topics include everything related to MySQL.

Posted in Uncategorized on December 18th, 2012 by spetrunia | | 0 Comments

New optimization in MariaDB 10.0: EXISTS-to-IN subquery rewrite

MariaDB 10.0 has got another new feature: Sanja Byelkin has pushed EXISTS-to-IN rewrite feature. It is an optimization targeted at EXISTS subqueries. The idea behind it is simple. EXISTS subqueries often have form:

EXISTS (SELECTFROMWHERE outer_col=inner_col AND inner_where)

where outer_col=inner_col is the only place where the subquery has references to outside. In this case, the subquery can be converted into an uncorrelated IN:

outer_col IN (SELECT inner_col FROMWHERE inner_where)

The conversion opens new opportunities for the optimizer. Correlated EXISTS subquery has only one execution strategy. Uncorrelated IN subquery has two:

  1. re-run the subquery every time the subquery is evaluated (the same as in EXISTS)
  2. Materialize the subquery output into a temporary table. Then, evaluation of subquery predicate will be reduced to making a lookup in that temporary table

The optimizer is able to make a cost-based choice whether to do #1 or #2. But wait, there is more: if the subquery is at top level of the WHERE clause, like


then the subquery is converted into a semi-join, which has even more execution strategies, including those that will use the subquery to “drive” access to tables in the parent select. In the above example, EXISTS-to-IN allows the optimizer to first access the books, and then for each book access its author. When your database has a few books and lots of people, this execution order can give a big speedup.

Now, a little context. EXISTS->IN rewrite is useful, but is not a revolutionary breakthrough. PostgreSQL has a similar optimization. They have introduced it in release 8.4.1, which happened in July, 2009. IIRC MySQL/Sun also had a task for adding EXISTS->IN rewrite, also around 2009. I don’t know what the fate of that task was. It is definitely not in MySQL 5.6, and google doesn’t find the worklog entry, I suppose because it has been made private. MariaDB’s implementation was developed from scratch here at Monty Program Ab. Now, when we have it in MariaDB 10.0, I’m wondering what are the other important subquery cases that we have not yet covered.

Posted in Uncategorized on December 18th, 2012 by spetrunia | | 0 Comments

Looking at MySQL 5.6’s optimizer: EXPLAIN UPDATE

MySQL 5.6 adds support for EXPLAIN UPDATE. This is a useful feature, so we want to have it in MariaDB 10.0, too. Besides that, MariaDB 10.0 has SHOW EXPLAIN feature, and we want it work for UPDATE commands, too.

Now, a bit of code history. Why didn’t MySQL have EXPLAIN UPDATE from the start, like other database systems? To the uninformed, lack of EXPLAIN UPDATE looks like simple lazyness. After all, everyone who has read a database textbook can imagine that the code should have this form:

run_update_query(UpdateQuery q) {
   QueryPlan qp= optimize_query(q);

and adding EXPLAIN UPDATE is a matter of adding another function:

run_explain_update(UpdateQuery q) {
   QueryPlan qp= optimize_query(q);

print_update_query_plan(QueryPlan qp)
  // print the plan for UPDATE.

Seems like a task for an intern. The problem is that MySQL’s code is not structured this way. There is no point in time where all decisions about
how to run the UPDATE command have been made and stored in a certain data structure, but the query execution didn’t start yet. The code basically is structured like this:

  typename part_of_query_plan;

  …do something…
    if (all done)
  …do something …

  if (…)
    typename another_part_of_query_plan;
    …do something else…
    another_part_of_query_plan= …

  typename yet_another_part_of_query_pan= …;

It is not trivial to pull out all of query plan choices out of this. Oracle’s optimizer team had two possible options:

  1. Re-write UPDATE handling code to use the textbook approach.
  2. Keep the current code structure, and inject “if (running_explain) {...}” at many locations.

#1 would be a minor revolution. It would introduce new code that is run for every UPDATE query. New code may have bugs. It may cause query plans to change, and not always for the better.
#2 is conservative. It would keep the old structure in place, and would require less work. The result won’t be very impressive, though - there will be a single piece of code that handles both UPDATE and EXPLAIN UPDATE, with lots of “if (running_explain) {...}” all over it.

I guess, the choice depends on your deadlines, what other changes are there, etc. Oracle’s team choose to do #2. However, when I tried playing with it, I’ve found

  • a query plan that has changed since 5.5 (BUG#67638)
  • a wrong query plan - EXPLAIN doesn’t match the execution (BUG#67637)

I’m not sure if BUG#67638 is a problem. Maybe, it is expected because of the changes in the cost model. However, if the change was expected anyway, why did they choose to use the conservative solution for EXPLAIN UPDATE? And if they did choose a conservative solution for EXPLAIN UPDATE, why do we still get bugs like BUG#67637?

The questions are not just of curiosity. We at MariaDB need to apply the patch, and make it work with SHOW EXPLAIN. Do we wait for Oracle to fix the above bugs, or fix them ourselves? Do we stick to their EXPLAIN UPDATE implementation (and keep applying their fixes), or forget it and roll our own? Decisions, decisions…

Posted in Uncategorized on November 20th, 2012 by spetrunia | | 0 Comments

Extended keys: First in MariaDB 5.5, now in mysql-trunk, too

One of the optimizations we have introduced in MariaDB 5.5 is Extended keys. The idea behind it is rather simple. Inside InnoDB, every secondary index has an invisible suffix of primary key columns. That is, when you create an index:

ALTER TABLE innodb_tbl ADD INDEX (column1, column2);

you’re actually creating this

ALTER TABLE innodb_tbl ADD INDEX (column1, column2, primary_key_column1, …, primary_key_columnN);

The index is extended with primary key columns. SHOW KEYS does not show these extra key parts, but they are there.

Traditionally, MySQL optimizer was half-aware of these extra columns. It knew that doing an index-only scan on InnoDB’s secondary key would read the primary key columns also, and used this property. On the other hand, the optimizer was not able to use the extra columns to do a ref or range access, or to resolve an ORDER BY clause (correction: the optimizer did take extra columns into account when resolving ORDER BY clauses). If you had a statement like:

SELECTFROM innodb_tbl WHERE column1=’foo’ AND ‘column2=’bar’ AND primary_key_column1=’baz’ 

the optimizer was only able to use two key parts for ref access. Extended keys optimization in MariaDB 5.5 removed this kind of limitations. MariaDB 5.5 is able to use the extra index columns for any purpose MySQL/MariaDB can use an index for.

So, what’s the news? The news is this commit. It seems, Oracle has decided to follow and also support extended keys. The email subject is “bzr push into mysql-trunk“, I suppose it means that the feature has been pushed into whatever will be the next version after MySQL 5.6 (mysql-trunk tree is not publicly available, so there’s a lot of guessing).

Looking at their patch, I see two differences from MariaDB’s version:

  1. optimizer_switch flag is named extended_keys in MariaDB and extended_secondary_keys in MySQL-trunk
  2. they inform the optimizer that the extended key is a UNIQUE key

The first one is trivial. The second one is a bit puzzling. It is true that primary key columns are unique, hence adding them to a secondary index makes the extended secondary index unique, too. But what is the benefit of knowing this? Index Unique-ness can only be used when you’ve got values for each of the columns. But if you’ve got values for each of extended key columns (column1, column2, primary_key_column1, …, primary_key_columnN), you’ve also got value for each of primary key columns (primary_key_column1, …, primary_key_columnN). Why would you want to use the secondary index then? The primary key is also usable, and it is shorter.

It would be nice to know what Sergey Glukhov (author of Oracle’s implementation) had in mind when he was coding this. Or, we’ll have to figure on our own, and add the extended-key-is-unique feature to MariaDB’s implementation.

Posted in Uncategorized on November 10th, 2012 by spetrunia | | 0 Comments

More on cost-based choice between subquery Materialization and IN->EXISTS

In my previous post, I shared my finding that MySQL 5.6.7 does not make a cost-based choice between Materialization and IN-to-EXISTS strategies for subqueries.

It turns out I was wrong. As Guilhem Bichot has blogged here, he has implemented cost-based choice between Materialization and IN->EXISTS in MySQL 5.6.7. Igor Babaev also wrote about the topic, and covered the reasons I didn’t see the feature - it isn’t mentioned in the documentation, development process at Oracle is quite closed, and the feature didn’t work for a basic example that I have tried.

Let’s try to focus on the technical part of it. Looking at the source code, I see that MySQL and MariaDB’s implementations try to achieve a similar effect, but they are different - neither one is a copy of the other. This makes it interesting to do a comparison.

First, there is the example from my previous blog post:

MySQL [test]> explain select col1, col1 in (select col2 from one_k_rows_tbl) from one_row_tbl;
| id | select_type | table          | type | possible_keys | key  | key_len | ref  | rows | Extra |
|  1 | PRIMARY     | one_row_tbl    | ALL  | NULL          | NULL | NULL    | NULL |    1 | NULL  |
|  2 | SUBQUERY    | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1000 | NULL  |

Here, the upper select reads only one row, which means that the subquery predicate will need to be evaluated one time. Now, if you need to execute the subquery one time, the question is which of the following is cheaper:

  1. just execute it (IN->EXISTS), or
  2. execute it, and dump its output into a temporary table with a unique key (Materialization)

Apparently, the the first should be cheaper, yet MySQL 5.6.7 chooses the second. I’ve filed BUG#67511 about this, and posted there some ideas about the reason why the poor plan was chosen.

The second difference was found by Igor Babaev through experiment, and I confirmed it by reading the code. Suppose, there is a join, and a subquery:

  FROM Country, City
  WHERE City.Country = Country.Code AND
        Country.SurfaceArea < 3000 AND Country.SurfaceArea > 10 AND
        (Country.Name IN (select Language from CountryLanguage where Percentage > 50) OR LIKE ‘%Island%’);

and query plan is as follows (this is output from MySQL 5.6):

| id | select_type | table           | type  | possible_keys       | key        | key_len | ref             | rows | Extra                    |
|  1 | PRIMARY     | Country         | ALL   | PRIMARY,SurfaceArea | NULL       | NULL    | NULL            |  207 | Using where              |
|  1 | PRIMARY     | City            | ref   | Country             | Country    | 3       | j1.Country.Code |    7 | NULL                     |
|  2 | SUBQUERY    | CountryLanguage | range | Percentage,Language | Percentage | 4       | NULL            |  168 | Using where; Using index |

Here, the subquery is attached to table Country, that is, it is executed 207 times. MariaDB will calculate the number correctly. MySQL on the other hand, will assume the subquery is evaluated as many times as many record combinations are there in the parent select, in this example 207 * 7 = 1449 times. As the above EXPLAIN shows, MySQL 5.6.7 will use Materialization (reasonable choice if you expect 1449 subquery re-executions). MariaDB will use a correct estimate of 207 subquery re-executions and will pick IN->EXISTS. Query plans are already different. The tables in this example are small, so there’s no measurable difference in execution speed. I suppose one could build a bigger example, where the speed difference can be actually observed.

Finally, the 3rd difference is that

  • MariaDB will optimize the subquery before the outer query
  • MySQL will optimize the subquery after the outer query

This doesn’t have any apparent user-visible effects currently, yet Timour Katchaounov has spent substantial amount of time to make MariaDB optimize the subquery first. Why did he do it? I’ll cover in subsequent blog posts.

Posted in Uncategorized on November 8th, 2012 by spetrunia | | 0 Comments

Comparison of subquery optimizations in MySQL 5.6 and MariaDB 5.5

MySQL 5.6 is now RC, I suppose this means that all features that were intended to be in release are pushed, so it’s time to take a look and see what’s going to be in MySQL 5.6 GA.

I decided to look at subquery optimizations and compare them to what we’ve got in MariaDB 5.3/5.5. In case you don’t know, subquery optimizations in MySQL 5.6 and MariaDB come from a common ancestor - MySQL 6.0 alpha, which was released in spring 2009 and then abandoned because there were too many unstable features pushed into it.

Then, both MariaDB team and Oracle salvaged subquery code out of the 6.0 mess, and worked to get it in shape for release. MariaDB released its results in GA quality in April 2012 as MariaDB 5.3, which was quickly followed by MariaDB 5.5.

Inside MariaDB, we’ve considered 6.0-alpha’s feature set to be incomplete, and released only after we’ve added these three features:

We’ve even had a subquery optimizations map to make sure we cover all kinds of subqueries.

Now, I’m looking through MySQL 5.6.7 and 5.6.x changelogs, and the only improvement over original set of MySQL 6.0’s optimizations seems to be this part in 5.6.4 changelog:

The optimizer detects and optimizes away these useless query parts within IN/ALL/SOME/EXISTS subqueries:

  • GROUP BY, if there is no HAVING clause and no aggregate functions
  • ORDER BY, which has no effect because LIMIT is not supported in these subqueries

This is a nice-to-have feature. The problem is that this feature does not cover the gaps in the set of MySQL 5.6’s subquery optimizations. To see what I’m talking about, let’s take the “Cost-based choice between Materialization and IN->EXISTS strategies” feature, and see why we absolutely had to have it in MariaDB 5.3.

Let’s consider a query with an uncorrelated subquery:
select col1, col1 in (select col2 from one_k_rows_tbl) from ten_rows_tbl

In MySQL 5.1, the only way to execute it was to use IN->EXISTS conversion. EXPLAIN in MySQL 5.1 looks like this:

| id   | select_type        | table          | type | possible_keys | key  | key_len | ref  | rows | Extra       |
|    1 | PRIMARY            | ten_rows_tbl   | ALL  | NULL          | NULL | NULL    | NULL |   10 |             |
|    2 | DEPENDENT SUBQUERY | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1344 | Using where |

“DEPENDENT SUBQUERY” means that the subselect is re-executed for each record of the parent select. The original subquery is not correlated, but IN->EXISTS transformation converts the subquery from

col1 in (select col2 from one_k_rows_table)


exists (select 1 from one_k_rows_table where col2=ten_rows_table.col1 LIMIT 1)

The conversion makes the subquery correlated, but in return the subquery gets the “col2=col1″ IN-equality in the WHERE, which can make a big difference.

MySQL 6.0 has added Materialization strategy, and both MySQL 5.6 and MariaDB have got it from there. In MySQL 5.6, the EXPLAIN looks like this:

MySQL [test]> explain select col1, col1 in (select col2 from one_k_rows_tbl) from ten_rows_tbl;
| id | select_type | table          | type | possible_keys | key  | key_len | ref  | rows | Extra |
|  1 | PRIMARY     | ten_rows_tbl   | ALL  | NULL          | NULL | NULL    | NULL |   10 | NULL  |
|  2 | SUBQUERY    | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1000 | NULL  |

It looks like uncorrelated subquery, but actually it is Materialization. MariaDB’s EXPLAIN output is very similar:

MariaDB [test]> explain select col1, col1 in (select col2 from one_k_rows_tbl) from ten_rows_tbl;
| id   | select_type  | table          | type | possible_keys | key  | key_len | ref  | rows | Extra |
|    1 | PRIMARY      | ten_rows_tbl   | ALL  | NULL          | NULL | NULL    | NULL |   10 |       |
|    2 | MATERIALIZED | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1344 |       |

except that we have decided to print MATERIALIZED in select_type column to avoid possible confusion with regular uncorrelated subqueries.

Now, let’s see the gap. MySQL 6.0 had support for subquery Materialization, but it lacked ability to make a cost-based choice whether to use Materialization. It used it whenever possible. Let’s change our query so that the outer select expects to read only one row:

MariaDB [test]> explain select col1, col1 in (select col2 from one_k_rows_tbl) from one_row_tbl;
| id   | select_type        | table          | type | possible_keys | key  | key_len | ref  | rows | Extra       |
|    1 | PRIMARY            | one_row_tbl    | ALL  | NULL          | NULL | NULL    | NULL |    1 |             |
|    2 | DEPENDENT SUBQUERY | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1344 | Using where |

As you can see, MariaDB has decided to use the good oldDEPENDENT SUBQUERY, that is, IN->EXISTS strategy. Why? It figured that the upper select expects to read only one row, which means the subquery is expected to be evaluated only once. With Materialization strategy, one needs to pay a big upfront cost (materialization) in order to have cheap subquery re-execution. This pays off when the subquery is executed many times. This doesn’t pay off when you need to execute the subquery once, or just a few times. MariaDB was able to recognize this and made a cost-based decision.

Now, let’s run this query on MySQL 5.6:

MySQL [test]> explain select col1, col1 in (select col2 from one_k_rows_tbl) from one_row_tbl;
| id | select_type | table          | type | possible_keys | key  | key_len | ref  | rows | Extra |
|  1 | PRIMARY     | one_row_tbl    | ALL  | NULL          | NULL | NULL    | NULL |    1 | NULL  |
|  2 | SUBQUERY    | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1000 | NULL  |

It still uses Materialization. Neither changelogs, nor the source code have any mention of cost-based choice, so I assume they haven’t made any improvements over MySQL 6.0 code. As I’ve mentioned, Materialization is a startup-heavy strategy, so one can expect MySQL 5.6 to show regressions for queries where the subquery is executed only a few times.

UPDATE It turned out I was wrong about what’s included in MySQL 5.6. See this post for details.

Posted in Uncategorized on October 9th, 2012 by spetrunia | | 0 Comments

SHOW EXPLAIN and skeletons in EXPLAIN’s closet

I believe I’m nearly done with the SHOW EXPLAIN feature. The idea is that if you’ve got some long-running query $LONG_QUERY running in connection $LONG_QUERY_CONN, you should be able to create another connection, run SHOW EXPLAIN FOR $LONG_QUERY_CONN and see what query plan is being used to run the $LONG_QUERY.

How is this different from just running explain $LONG_QUERY? First, you don’t need to replicate the exact environment that $LONG_QUERY was run in: you don’t need to figure out what values it had for @@optimizer_switch and other settings, what were the contents of its temporary tables, if any, what did InnoDB told the optimizer about the table statistics, etc.

Another, indirect benefit is that we will now be able to produce query’s EXPLAIN at arbitrary point in time, which should make it possible to store it in the slow query log, or support EXPLAIN ANALYZE command.

To the uninformed, coding SHOW EXPLAIN feature may seem trivial: one will only need to walk the query plan and print it out. In reality, it is not as easy: first, you need to take into account that MySQL/MariaDB has a habit of optimizing parts of query plan lazily, also, it is eager to free parts of query plan that are no longer needed. Then, there are slight differences between data structures used by EXPLAINs and regular SELECTs.

It also turns out EXPLAIN had some skeletons in its closet. We have been testing SHOW EXPLAIN by comparing its output with EXPLAIN output, and I have already found two cases where EXPLAIN’s output doesn’t match the query execution. Which is scary, because I personally used to trust EXPLAIN, and everybody I worked with trusted it, too.

I’ve fixed (in MariaDB) one of the wrong EXPLAINs, MDEV-410, since it relates to subqueries and we have a lot of expertise in that code. The second one, MDEV-326, remains unfixed, because it’s in the sorting/grouping code and I’m afraid I can open a big can of worms if I touch it.

Posted in Uncategorized on July 27th, 2012 by spetrunia | | 0 Comments

Please try your subqueries on MariaDB

MariaDB 5.3 is now GA, and MariaDB 5.5 is RC. One of the primary features in these releases is all-round coverage with subquery optimizations. Practically every kind of subquery available in SQL has got some new optimization.

We do a lot of testing, so these new optimizations should be now reasonably stable. What is missing is performance testing with real-world queries on real-world data. I expect most of the time you will see a speedup, however, there can also be cases where the new version will be slower. New optimizations make query plan search space much bigger, this means the new optimizer will have lots of room to make errors where previously was none.

Back at MySQL Ab, I could use bugs/support cases to do some analysis of how real-world queries are affected by the new optimizations.

Now, we don’t have access to that data anymore, and so are asking for help: If you’ve got some queries with subqueries, please try running them on the latest MariaDB 5.3 Stable or MariaDB 5.5 RC and let us know of the results.

Posted in Uncategorized on April 4th, 2012 by spetrunia | | 3 Comments

Statistics counters for Multi Range Read

MariaDB 5.3 has now three statistics counters for Multi Range Read optimization:

MariaDB [test]> show status like 'Handler_mrr%';
| Variable_name                 | Value |
| Handler_mrr_extra_key_sorts   | 0     |
| Handler_mrr_extra_rowid_sorts | 0     |
| Handler_mrr_init              | 0     |
3 rows in set (0.08 sec)

I’ve just added the first two. The reason for having them is as follows: the point of MRR is to provide speedup over regular execution by doing reads in disk order. In order to make reads in disk order, MRR needs buffer space where it accumulates and sorts read requests. If there are too many read requests to fit into the buffer, MRR will make multiple accumulate-sort-read passes.

Doing multiple passes allows MRR to operate when having limited buffer space, but the speedup will be not as great as with one big disk-ordered read sweep. The purpose of Handler_mrr_extra_key_sorts and Handler_mrr_extra_rowid_sorts is to count the additional accumulate-sort-read passes, so you’re able to tell if you will benefit from increasing your @@mrr_buffer_size and @@join_buffer_size settings.

There are two counters, _extra_key_sorts and _extra_rowid_sorts, because MariaDB has two places where it will do sorting:

  1. sort rowids before reading table records
  2. sort key values before making a bunch of index lookups

MRR code will try to distribute buffer space between them in an optimal way. The decision is a guess based on the available statistics, and can be wrong. Having both counters will allow us to check how the guess will work in practice.

p.s. if you could not make any sense of anything above, try reading Multi Range Read page in our knowlegebase. We have just put there a hopefully-readable explanation of what MRR is.

Posted in Uncategorized on January 27th, 2012 by spetrunia | | 0 Comments

« Previous PageNext Page »