Looking at MySQL 5.6’s optimizer: EXPLAIN UPDATE

MySQL 5.6 adds support for EXPLAIN UPDATE. This is a useful feature, so we want to have it in MariaDB 10.0, too. Besides that, MariaDB 10.0 has SHOW EXPLAIN feature, and we want it work for UPDATE commands, too.

Now, a bit of code history. Why didn’t MySQL have EXPLAIN UPDATE from the start, like other database systems? To the uninformed, lack of EXPLAIN UPDATE looks like simple lazyness. After all, everyone who has read a database textbook can imagine that the code should have this form:

run_update_query(UpdateQuery q) {
   QueryPlan qp= optimize_query(q);
   run_query_plan(qp);
}

and adding EXPLAIN UPDATE is a matter of adding another function:

run_explain_update(UpdateQuery q) {
   QueryPlan qp= optimize_query(q);
   print_update_query_plan(qp);
}

print_update_query_plan(QueryPlan qp)
{
  // print the plan for UPDATE.
}

Seems like a task for an intern. The problem is that MySQL’s code is not structured this way. There is no point in time where all decisions about
how to run the UPDATE command have been made and stored in a certain data structure, but the query execution didn’t start yet. The code basically is structured like this:

mysql_update()
{
  typename part_of_query_plan;

  …do something…
  part_of_query_plan=…;
    if (all done)
      return;
  …do something …

  if (…)
  {
    typename another_part_of_query_plan;
    …do something else…
    another_part_of_query_plan= …
  }

  typename yet_another_part_of_query_pan= …;
  …
}

It is not trivial to pull out all of query plan choices out of this. Oracle’s optimizer team had two possible options:

  1. Re-write UPDATE handling code to use the textbook approach.
  2. Keep the current code structure, and inject “if (running_explain) {...}” at many locations.

#1 would be a minor revolution. It would introduce new code that is run for every UPDATE query. New code may have bugs. It may cause query plans to change, and not always for the better.
#2 is conservative. It would keep the old structure in place, and would require less work. The result won’t be very impressive, though - there will be a single piece of code that handles both UPDATE and EXPLAIN UPDATE, with lots of “if (running_explain) {...}” all over it.

I guess, the choice depends on your deadlines, what other changes are there, etc. Oracle’s team choose to do #2. However, when I tried playing with it, I’ve found

  • a query plan that has changed since 5.5 (BUG#67638)
  • a wrong query plan - EXPLAIN doesn’t match the execution (BUG#67637)

I’m not sure if BUG#67638 is a problem. Maybe, it is expected because of the changes in the cost model. However, if the change was expected anyway, why did they choose to use the conservative solution for EXPLAIN UPDATE? And if they did choose a conservative solution for EXPLAIN UPDATE, why do we still get bugs like BUG#67637?

The questions are not just of curiosity. We at MariaDB need to apply the patch, and make it work with SHOW EXPLAIN. Do we wait for Oracle to fix the above bugs, or fix them ourselves? Do we stick to their EXPLAIN UPDATE implementation (and keep applying their fixes), or forget it and roll our own? Decisions, decisions…

Posted in Uncategorized on November 20th, 2012 by spetrunia | | 0 Comments

Extended keys: First in MariaDB 5.5, now in mysql-trunk, too

One of the optimizations we have introduced in MariaDB 5.5 is Extended keys. The idea behind it is rather simple. Inside InnoDB, every secondary index has an invisible suffix of primary key columns. That is, when you create an index:

ALTER TABLE innodb_tbl ADD INDEX (column1, column2);

you’re actually creating this

ALTER TABLE innodb_tbl ADD INDEX (column1, column2, primary_key_column1, …, primary_key_columnN);

The index is extended with primary key columns. SHOW KEYS does not show these extra key parts, but they are there.

Traditionally, MySQL optimizer was half-aware of these extra columns. It knew that doing an index-only scan on InnoDB’s secondary key would read the primary key columns also, and used this property. On the other hand, the optimizer was not able to use the extra columns to do a ref or range access, or to resolve an ORDER BY clause (correction: the optimizer did take extra columns into account when resolving ORDER BY clauses). If you had a statement like:

SELECTFROM innodb_tbl WHERE column1=’foo’ AND ‘column2=’bar’ AND primary_key_column1=’baz’ 

the optimizer was only able to use two key parts for ref access. Extended keys optimization in MariaDB 5.5 removed this kind of limitations. MariaDB 5.5 is able to use the extra index columns for any purpose MySQL/MariaDB can use an index for.

So, what’s the news? The news is this commit. It seems, Oracle has decided to follow and also support extended keys. The email subject is “bzr push into mysql-trunk“, I suppose it means that the feature has been pushed into whatever will be the next version after MySQL 5.6 (mysql-trunk tree is not publicly available, so there’s a lot of guessing).

Looking at their patch, I see two differences from MariaDB’s version:

  1. optimizer_switch flag is named extended_keys in MariaDB and extended_secondary_keys in MySQL-trunk
  2. they inform the optimizer that the extended key is a UNIQUE key

The first one is trivial. The second one is a bit puzzling. It is true that primary key columns are unique, hence adding them to a secondary index makes the extended secondary index unique, too. But what is the benefit of knowing this? Index Unique-ness can only be used when you’ve got values for each of the columns. But if you’ve got values for each of extended key columns (column1, column2, primary_key_column1, …, primary_key_columnN), you’ve also got value for each of primary key columns (primary_key_column1, …, primary_key_columnN). Why would you want to use the secondary index then? The primary key is also usable, and it is shorter.

It would be nice to know what Sergey Glukhov (author of Oracle’s implementation) had in mind when he was coding this. Or, we’ll have to figure on our own, and add the extended-key-is-unique feature to MariaDB’s implementation.

Posted in Uncategorized on November 10th, 2012 by spetrunia | | 0 Comments

More on cost-based choice between subquery Materialization and IN->EXISTS

In my previous post, I shared my finding that MySQL 5.6.7 does not make a cost-based choice between Materialization and IN-to-EXISTS strategies for subqueries.

It turns out I was wrong. As Guilhem Bichot has blogged here, he has implemented cost-based choice between Materialization and IN->EXISTS in MySQL 5.6.7. Igor Babaev also wrote about the topic, and covered the reasons I didn’t see the feature - it isn’t mentioned in the documentation, development process at Oracle is quite closed, and the feature didn’t work for a basic example that I have tried.

Let’s try to focus on the technical part of it. Looking at the source code, I see that MySQL and MariaDB’s implementations try to achieve a similar effect, but they are different - neither one is a copy of the other. This makes it interesting to do a comparison.

First, there is the example from my previous blog post:

MySQL [test]> explain select col1, col1 in (select col2 from one_k_rows_tbl) from one_row_tbl;
+----+-------------+----------------+------+---------------+------+---------+------+------+-------+
| id | select_type | table          | type | possible_keys | key  | key_len | ref  | rows | Extra |
+----+-------------+----------------+------+---------------+------+---------+------+------+-------+
|  1 | PRIMARY     | one_row_tbl    | ALL  | NULL          | NULL | NULL    | NULL |    1 | NULL  |
|  2 | SUBQUERY    | one_k_rows_tbl | ALL  | NULL          | NULL | NULL    | NULL | 1000 | NULL  |
+----+-------------+----------------+------+---------------+------+---------+------+------+-------+

Here, the upper select reads only one row, which means that the subquery predicate will need to be evaluated one time. Now, if you need to execute the subquery one time, the question is which of the following is cheaper:

  1. just execute it (IN->EXISTS), or
  2. execute it, and dump its output into a temporary table with a unique key (Materialization)

Apparently, the the first should be cheaper, yet MySQL 5.6.7 chooses the second. I’ve filed BUG#67511 about this, and posted there some ideas about the reason why the poor plan was chosen.

The second difference was found by Igor Babaev through experiment, and I confirmed it by reading the code. Suppose, there is a join, and a subquery:

SELECT *
  FROM Country, City
  WHERE City.Country = Country.Code AND
        Country.SurfaceArea < 3000 AND Country.SurfaceArea > 10 AND
        (Country.Name IN (select Language from CountryLanguage where Percentage > 50) OR
         Country.name LIKE ‘%Island%’);

and query plan is as follows (this is output from MySQL 5.6):

+----+-------------+-----------------+-------+---------------------+------------+---------+-----------------+------+--------------------------+
| id | select_type | table           | type  | possible_keys       | key        | key_len | ref             | rows | Extra                    |
+----+-------------+-----------------+-------+---------------------+------------+---------+-----------------+------+--------------------------+
|  1 | PRIMARY     | Country         | ALL   | PRIMARY,SurfaceArea | NULL       | NULL    | NULL            |  207 | Using where              |
|  1 | PRIMARY     | City            | ref   | Country             | Country    | 3       | j1.Country.Code |    7 | NULL                     |
|  2 | SUBQUERY    | CountryLanguage | range | Percentage,Language | Percentage | 4       | NULL            |  168 | Using where; Using index |
+----+-------------+-----------------+-------+---------------------+------------+---------+-----------------+------+--------------------------+

Here, the subquery is attached to table Country, that is, it is executed 207 times. MariaDB will calculate the number correctly. MySQL on the other hand, will assume the subquery is evaluated as many times as many record combinations are there in the parent select, in this example 207 * 7 = 1449 times. As the above EXPLAIN shows, MySQL 5.6.7 will use Materialization (reasonable choice if you expect 1449 subquery re-executions). MariaDB will use a correct estimate of 207 subquery re-executions and will pick IN->EXISTS. Query plans are already different. The tables in this example are small, so there’s no measurable difference in execution speed. I suppose one could build a bigger example, where the speed difference can be actually observed.

Finally, the 3rd difference is that

  • MariaDB will optimize the subquery before the outer query
  • MySQL will optimize the subquery after the outer query

This doesn’t have any apparent user-visible effects currently, yet Timour Katchaounov has spent substantial amount of time to make MariaDB optimize the subquery first. Why did he do it? I’ll cover in subsequent blog posts.

Posted in Uncategorized on November 8th, 2012 by spetrunia | | 0 Comments