An interesting case in ORDER BY LIMIT optimization

Recently, I was asked about an interesting case in ORDER BY … LIMIT optimization. Consider a table

create table tbl (
  …
  KEY key1(col1, col2),
  PRIMARY KEY (pk)
) engine=InnoDB;

Consider queries like:

  select * from tbl where col1=’foo’ and col2=123 order by pk limit 1;
  select * from tbl where col1=’bar’ and col2=123 order by pk limit 1;

These run nearly instantly. But, if one combines these two queries with col1='foo' and col1='bar' into one query with col1 IN ('foo','bar'):

  select * from tbl where col1 IN (’foo’,'bar’) and col2=123 order by pk limit 1;

then the query is be orders of magnitude slower than both of the queries with col1=const.

The first thing to note when doing investigation is to note that the table uses InnoDB engine, which has extended_keys feature. This means, the index

  KEY key1(col1, col2)

is actually

  KEY key1(col1, col2, pk)

Once you have that, and also you have col1='foo' AND col2=123 in the WHERE clause, the optimizer is able to see that index `key1` produces records ordered by the `pk` column, i.e. in the order required by the ORDER BY clause. This allows to satisfy the LIMIT 1 part by reading just one row.

Now, if we change col1='foo' into col1 IN('foo','bar'), we will still be able to use index `key1`, but the rows we read will not be ordered by `pk`. They will come in two ordered batches:

  'bar', 123, pkX
  'bar', 123, pkX+1
  'bar', 123, pkX+2
  ...
  'foo', 123, pkY
  'foo', 123, pkY+1
  'foo', 123, pkY+2

The query has ORDER BY pk LIMIT 1, but, since the rowset is not ordered by pk, the optimizer will have to read all of the rows, sort them, and find the row with the least value of `pk`.

Now, wouldn’t it be great if the optimizer was aware that the index scan returns two ordered batches? It would be able to read not more than #LIMIT rows from each batch. I can think of two possible execution strategies:

  1. Run something similar to index_merge strategy: start an index scan col1='foo' and an index scan on col1='bar'. Merge the two ordered streams until we’ve found #limit rows. This approach works well when you’re merging a few streams. If there are a lot of streams, the overhead of starting concurrent index scans will start to show up.
  2. Use the same index cursor to #LIMIT rows from the first batch, then from the second, and so forth. Merge these ordered streams using filesort’s merge pass or priority queue. This approach reads more rows than the first one, but we don’t have to create another index cursor.

Now, the question is whether this kind of queries is frequent enough to implement this optimization.

Posted in ideas, mysql, mariadb on May 14th, 2014 by spetrunia | | 0 Comments

A proposal for method of delivering optimizer bug fixes

Working on query optimizer bugs can be a rather frustrating experience. First, as soon as some query doesn’t run as fast it theoretically could people will consider it a bug. On one hand that’s great, you get a constant stream of user input, but on the other hand you end up with a whole pile of “bugs” which you can’t hope to finish.

What’s more frustrating is that even if you manage to create a fix for an optimizer bug, there are chances it won’t be allowed into next GA (currently 5.0.70) or approaching-GA (currently 5.1.30) release (GA is our term for “stable” or “release”).

The reason behind this is that most optimizer bugfixes cause the optimizer to pick different query plans, and there’s no way to guarantee that the fix will be a change for the better for absolutely everyone. Experience shows that it is possible to have a query that hits two optimizer bugs/deficiencies at once in such a way that they cancel each other out, and get problems when one of the bugs is fixed. A more common scenario is when the optimizer makes the best choice but it just doesn’t have all the info. The top five unknowns are

  • data distributions
  • correlations between data columns
  • correlations between data value and physical record order
  • highly selective conditions on non-indexed columns
  • hot/cold caches

Neither of those can be easily checked, so we’re very conservative and have the “no query plan changes in GA versions” rule.

The problem is that it turns out our GA releases aren’t very frequent and one may have to wait a looong time before the fix makes it into official GA release. Bazaar and its ease of creation of publicly-accessible branches have rectified the situation a bit but most users want a binary and also we don’t want to end up with the need to maintain 2^N branches after N optimizer bugs.

The proposal

This aims at query optimizer (”here’s a query which uses a non-optimal plan”-type) bugfixes that affect a small amount of code in small number of places.

  • We’ll put the fix into both GA and next-after-GA versions.
  • For next-after-GA version, just put the fix in, do not support the old behavior. That’s the only feasible long-term option, we can’t afford to support all behavior we’ve had at some point in the past.
  • For the GA version, make it possible to switch new behavior on and off. The default should be the old behavior (so we only put one “if” into the old execution path. Can one hope that *that* won’t break anything?).

The mechanism to turn on the new behavior will be server command line option, something like --with-bugfix=NNNN. It’s possible to ask to turn on multiple bugfixes by using the option several times:

mysqld --with-bugfix=37642  --with-bugfix=13356

or, in my.cnf syntax:

[mysqld]
...
with-bugfix=13356
with-bugfix=27432
...

The code of GA versions doesn’t change much, so it should be tolerable to have, say, twenty “if (bugfix_nnn) {…} else {…}” branches. mysqld binary should only know numbers of bugs which it has switchable fixes for. If it is invoked with –with-bugfix=N where N is not a bug number it knows, it should print issue a warning, something like this:

[Warning] This version doesn't have ability to switch fix BUG#NNNN , see
[Warning]   http://bugs.mysql.com/check-version.php?binary_version=X.Y.Z&bug=NNNN.

Visiting the printed URL gets you to the bugs database which has information about which fixes appeared in which versions, so it can tell you whether your binary already has the fix for BUG#NNNN integrated into the code or you need to upgrade, in which case it can tell you what is the first version that has the needed bugfix.

-end of proposal-

Any comments or feedback on this scheme are welcome. Their impact will be greater if they arrive till September, 17. We’re having a developer meeting on Sept, 17-24 and I’ll try to get this discussed and some decision made about this.

Posted in ideas, bugfixes on September 9th, 2008 by spetrunia | | 17 Comments

Pluggable storage engine interface needs to support table name resolution hooks

I’ve started some attempts at coding ha_trace storage engine I’ve mentioned earlier. One of the first things that became apparent was that I needed a way to put a hook into table name resolution code so I can wrap tables into ha_trace objects.

The need to intercept table name resolution and do something other than looking at the .frm files is not unique to call trace engine:

  • Remote storage engines would benefit also:
    • NDB has a whole chunk of code that ships .frm files from one connected mysqld instance to another. It doesn’t hook into name resolution; it ships table definitions proactively, which could be nuisance if you use a new mysqld node to just connect and run a few queries
    • Federated relies on the user to issue CREATE TABLE statement that matches remote table’s DDL. There are no checks that table defintions match, you can do “interesting” things like omitting columns or having index definitions that don’t match the indexes on the remote table. I think this has little or no value, I’d more appreciate an approach where I provide remote server/database/table name and the server figures out the DDL of the remote table itself.
  • Storage engines that have their own data dictionary could relieve MySQL from the need to use .frm files. I remember a discussion about somebody who had lots (tens of thousands iirc) of tables and was running into issues because of the need to open all those .frm files.

I think the above justifies adding name resolution hooks into storage engine interface. I don’t have enough time/cause to implement it, my intent is to do tracing, not name resolution. If you think you’ll also benefit from the name resolution hook, please post a note here or at internals@mysql.com, perhaps we could collect enough voices to get it implemented.

Posted in ideas on December 19th, 2007 by spetrunia | | 0 Comments

An idea: create ha_trace tracing storage engine

Our exprience in solving query optimizer problems shows that a good chunk of optimization problems are incorrect choice of join order. The most frequent causes of the problems are

  1. Table engine returns index statistics or records-in-range numbers that are very far from reality;
  2. The WHERE clause has high-selectivity conditions for one table and low-selectivity conditions for the other. The optimizer is not aware of this and chooses a poor join order;
  3. Bugs or shortcomings in the optimizer or the table engine;

At the moment investigation of those kinds of problems is hard and time-consuming:

  • There is no easy way to find out what the storage engine has returned to the optimizer from records-in-range calls. One has to manually repeat the steps taken by equality propagation, constant table detection and other query rewrites, construct the table’s condition and run the corresponding single-table EXPLAIN;
  • You have to construct and run separate queries to check if the table/index statistics matches the actual distribution of the data examined by the query (far-off statistics values for skewed data distribution is a frequent problem);
  • There is no easy way to obtain information about the selectivity of the table’s condition. You either need a debugger or, in some cases, you can manage with doing non-trivial calculations over the values of Handler_read_XXX counters.

A completely different problem is that now some storage engines are provided by other parties than MySQL, and we’ll need to be able to find out who is at fault when something doesn’t work with a 3rd-party engine.

Having spent some time thinking about both of the above, I’ve got an idea that both can be resolved by a single solution - write a tracing storage engine.

The ha_trace engine

The engine will be used this way:

  1. You take the tables that are involved in the problematic query and “wrap” them in ha_trace tables:
    CREATE TABLE trace_tbl1 (... copy the DDL ...) ENGINE=TRACE UNDERLYING_TBL='tbl1' 

    The trace tables look and can be queried exactly like their underlying tables

  2. You run the problematic query againist the wrapper tables. The trace engine intercepts all storage engine calls and either just records them or accumulates some summary information.
  3. You obtain the trace results by querying some special $trace$results table.

The trace engine can do many practically useful things:

Statistics checker mode
This mode will check if the engine statistics numbers are any good. ha_trace will intercept statistics and record retrieval API calls and compile a report on how much were the differences between statistics and the observed table data. It could also build histogram of the observed data, which would give a clue about whether running ANALYZE TABLE would help.

Call trace mode
This is similar to what ltrace(1) does. The advantage over ltrace is that ha_trace knows the meaning of the calls and can produce more meaningful output. In particular, it can provide smarter call filtering and compression of repetitive call patterns.

Call trace logs can be used to check if the performed calls are compliant with the storage engine API and find out who is the violator, SQL Layer or the engine.

Query trace mode
This is similar to call trace but more high level. It will provide a view of how query was executed, giving information about how many rows were scanned in each table and so forth. We’ll need to use some kind of compression to reduce the size of the collected data. I’m thinking of building a regular expression/automaton with counters in the edges, so the collected data looks like this:

  t1.rnd_init(TRUE /*do sequential scan*/)= 0
  t2.index_init(key1);
  1000 * {
      t1.rnd_next()= 0
      t2.index_read(...)= {0 - 700 times, EOF=300 times}
      20 (min=1,max=200) * { t2.index_next_same() }
   }
   t2.index_end()=0;
   t1.index_end()=0;

If you haven’t already figured out how the EXPLAIN of this query looks, here it is:

explain select * from t1, t2 where t2.key=t1.col;
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref     | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL    | 1000 |       |
|  1 | SIMPLE      | t2    | ref  | a             | a    | 5       | t1.key1 |   22 |       |
+----+-------------+-------+------+---------------+------+---------+---------+------+-------+

The trace shows a lot of useful information:
* In 300 of 1000 cases, table t2 had no matching records
* When there were matches, there were on average 20 matches
* There was an off-the-scale case when there were 200 matching rows

And you can just see all this, without the doing arithmetic exercises over ‘Handler_read_%’ status variables.

Whether this is the right solution

Wrapper engine is the natural solution for troubleshooting problems between storage engines and the SQL layer (think of ltrace, strace, ODBC trace, etc etc). It is a reasonably good query trace solution too:

  • Per-table and per-query tracing is a fine level of granularity that is big step ahead of counters
  • ha_trace adds exactly zero overhead when it is not enabled
  • Works for all current and future storage engines
  • ha_trace is probably the only tracing solution that doesn’t require making changes all over the code.

.
.
.
Having written all this in one sitting (well, also some subway standing), I’ll now try to spend the remainder of this weekend in a less geeky way.

Posted in ideas on July 7th, 2007 by spetrunia | | 4 Comments

Nested-loops join speedup idea promoted to WL task

The idea how to speed up nested-loops join a little I’ve mentioned earlier has now been promoted into a worklog entry. It is now known as WL#3724 “Short-Cutting Join Execution: Speeding up star queries” and its text is publicly available at MySQLForge.

At the moment there is only a short description, but hopefully Martin Hansson (the assigned developer) will add more content there.

Posted in ideas, joins on April 18th, 2007 by spetrunia | | 0 Comments

An idea how to speed up nested-loops join a little

Working on subquery optimizations, got an idea how to speed up join execution a little. Read on.

The idea

Consider a query:

select * from t1, t2, t3 where t3.key=t1.col1 and t2.key=t1.col2

Suppose the join order is t1, t2, t3, i.e. the EXPLAIN is like

+-------+------+---------------+------+---------+--------------+-..
| table | type | possible_keys | key  | key_len | ref          |
+-------+------+---------------+------+---------+--------------+-..
| t1    | ALL  | NULL          | NULL | NULL    | NULL         |
| t2    | ref  | key           | key  | 5       | test.t1.col1 |
| t3    | ref  | key           | key  | 5       | test.t1.col2 |
+-------+------+---------------+------+---------+--------------+-..

The important property is that access to t3 is independent of access to t2. MySQL’s nested loops join algorithm will run this as in this swimlane diagram:

NL execution diagram

Here we assume that

  • table t2 has 4 rows such that t2.key=t1.col1
  • table t3 doesn’t have any rows such that t3.key=t1.col2

As soon as we make first index lookup in table t3 (the one with blue border), we see that there will be no matching row combinations for this row of t1. Nevertheless, MySQL executioner will proceed to examine different rows in table t2 (marked with red border). This is redundant and can be easily avoided.

The implementation

The executioner part is easy: just maked the nested-loops join code to “jump back” in the cases like the illustrated. If we don’t find a match and table access/selection condition does not depend on the preceding table(s), then go back to the last table that the table access depends on. I don’t have the ready term for this, the working name is “short-cutting”.

The optimizer part is (as usual) more complicated. One way is to take the easy route: let the join optimizer (the part of code that chooses the join order) remain unaware of the short-cutting. Once the join order is produced, set up the executioner to perform short-cutting where appopriate.

The bad part is that join optimizer doesn’t account for short-cutting when comparing costs of various join orders, which may lead it to choose non-optimal join orders. However, taking short-cutting into account won’t be easy:

  • First, we’ll need to generate selection conditions for the join orders we consider (in other words, figure out if the EXPLAIN will have “Using WHERE”). Our current way of doing this will probably be too expensive to be done for each considered join order.
  • Second, we’ll need to get somewhere an estimate of how often short-cutting will occur. That is, we’ll need to know probability that for some *arbitrary* values of columns from preceding tables, table T will have no rows that would satisfy given condition COND(preceding_tables, T). This estimate is likely to end up being some while guess like “lets use
    0.5″.

At the moment the easy route seems like the way to go.

2007-02-11: Fixed typos in EXPLAIN ouput

Posted in ideas on January 29th, 2007 by spetrunia | | 5 Comments