Priority queue optimization for filesort is now visible in MariaDB 10.0

TL;DR: Priority queue optimization for filesort with small LIMIT is now visible in MariaDB: there is a status variable and you can also see it in the slow query log (KB page link).

A longer variant:
One of the new optimizations in MySQL 5.6 is ability to use a priority queue instead of sorting for ORDER BY … LIMIT queries. The optimization was ported into MariaDB 10.0 long ago, but we still get questions if/when it will be ported. I guess, the reason for this is that, besides the query speed, you can’t see this optimization. Neither EXPLAIN, nor EXPLAIN FORMAT=JSON or PERFORMANCE_SCHEMA or status variables give any indication whether filesort used priority queue or the regular quicksort+merge algorithm.

MySQL 5.6 has only one way one can check whether filesort used priority queue. You need to enable optimizer_trace (set optimizer_trace=1), and then run the query (not EXPLAIN, but the query itself). Then, you can look into the optimizer trace and find something like this:

          "filesort_priority_queue_optimization": {
            "limit": 10,
            "rows_estimate": 198717,
            "row_size": 215,
            "memory_available": 262144,
            "chosen": true

MariaDB doesn’t support optimizer_trace at the moment. Even if it did, I think it would be wrong to require one to look into the optimizer trace to find out about the picked query plan.

The natural place to show the optimization would be EXPLAIN output. We could show something like “Using filesort (priority queue)”. This was my initial intent. After looking into the source code, this turned out to be difficult to do. The logic that makes the choice between using quicksort+merge and using priority queue is buried deep inside query execution code. (As if the mess caused by late optimizations of ORDER BY and UNIONs didn’t teach anybody in MySQL team anything).

As for query execution, there are two facilities where one could record execution-time details about the query plan. They are the status variables and the slow query log.

Status variables

We’ve added Sort_priority_queue_sorts status variable. Now, the list of sort-related status variables is:

MariaDB [test]> show status like 'Sort%';
| Variable_name             | Value |
| Sort_merge_passes         | 0     |
| Sort_priority_queue_sorts | 1     |
| Sort_range                | 0     |
| Sort_rows                 | 11    |
| Sort_scan                 | 1     |

(Sort_range + Sort_scan) gives total number of sorts. Sort_priority_queue_sorts gives number of sorts that were done using priority queue.

Slow query log

Percona’s Extended statistics in the slow query log shows Filesort/Filesort_on_disk fields. We thought that adding information about priority queue use would be appropriate. Now, slow query log entries look like this:

# Time: 140714 18:30:39
# User@Host: root[root] @ localhost []
# Thread_id: 3  Schema: test  QC_hit: No
# Query_time: 0.053857  Lock_time: 0.000188  Rows_sent: 11  Rows_examined: 100011
# Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 0  Priority_queue: Yes
SET timestamp=1405348239;
select * from t1 where col1 between 10 and 20 order by col2 limit 100;

pt-query-digest is able to parse slow query logs with the new field.


What about PERFORMANCE_SCHEMA? After all, it is the most powerful tool for tracking query execution. It has “absorbed” some status variables into events_statements_history table. For sorting, it has these columns:

| SORT_MERGE_PASSES       | bigint(20) unsigned              | NO   |     | NULL    |       |
| SORT_RANGE              | bigint(20) unsigned              | NO   |     | NULL    |       |
| SORT_ROWS               | bigint(20) unsigned              | NO   |     | NULL    |       |
| SORT_SCAN               | bigint(20) unsigned              | NO   |     | NULL    |       |

Should we add a SORT_PRIORITY_QUEUE_SORTS column there? We didn’t add it into 10.0 right now because of compatibility concerns. Some tools may rely on the structure of PERFORMANCE_SCHEMA tables. Also, PERFORMANCE_SCHEMA table definitions are stored on disk, and one would have to run mysql_fix_privilege_tables after a minor upgrade, which is not good.

Posted in EXPLAIN, mysql, mariadb on July 14th, 2014 by spetrunia | | 0 Comments

New feature in MariaDB 10.1: ANALYZE statement

Last week, yours truly has pushed a new feature into MariaDB 10.1 tree: ANALYZE statement.

The idea of this feature is to make it easy to compare query plan with query execution. ANALYZE statement will run the statement, and produce EXPLAIN-like output, where optimizer’s estimates are followed by numbers that were observed when running the query. The output looks like this:



  • Next to rows there is r_rows column which shows how many records were read from the table.
  • Next to filtered there is r_filtered column which shows which fraction of records was left after the part of the WHERE condition attached to the table was checked.

I think this should explain the feature. If you want more details, please refer to the KB article ANALYZE statement. It also discusses the meaning of the above EXPLAIN output.

Technical details and further plans

ANALYZE currently uses its own counters. Counting is done for all queries, including non-ANALYZE queries. This should be okay (not have visible overhead) as long as counting just increments integer variables in the query plan, without doing any atomic operations or making syscalls.

The upside of this approach is that it’s now trivial to make Explain in the slow query log also print ANALYZE output. When a query runs slowly, you will be able to know where exactly the optimizer was wrong.

The downside is that getting more data will not be as easy. So far, the most requested numbers beyond r_rows and r_filtered were r_time(amount of time spent in reading the table) and r_io(amount of IO that we did on the table). Counting the amount of time that was spent while reading each row will impose CPU overhead, it is a known problem. Counting IO is just incrementing a counter, but it will require interaction between ANALYZE code and storage engine(s) code, which will add complexity.

There is PERFORMANCE_SCHEMA feature, where others have already spent a lot of effort to count wait time and IO. It’s tempting to reuse it. The problem is, P_S collects the wrong data. P_S counters are global, while ANALYZE needs to count IO for each table reference separately. Consider a self-join. From P_S point of view, it is reading from the same table. From ANALYZE point of view, it is reads from two different table references. I’m currently not sure whether ANALYZE should/could rely on PERFORMANCE_SCHEMA.

A totally different angle is that tabular EXPLAIN output doesn’t allow to show much data (for example, how many rows were there before/after GROUP BY?). Here the solution is clear, I think: support EXPLAIN FORMAT=JSON and then add ANALYZE FORMAT=JSON where we can provide lots of detail.

Posted in EXPLAIN, mysql, mariadb on June 30th, 2014 by spetrunia | | 3 Comments

Engineering behind EXPLAIN FORMAT=JSON (or lack thereof)

MySQL 5.6 has added support for EXPLAIN FORMAT=JSON. The basic use case for that feature is that one can look at the JSON output and see more details about the query plan. More advanced/specific use cases are difficult, though. The problem is, you can’t predict what EXPLAIN FORMAT=JSON will produce. There is no documentation or any kind of convention regarding the contents of JSON document that you will get.

To make sure I’m not missing something, I looked at MySQL Workbench. MySQL Workbench has a feature called Visual Explain. If you want to use, prepare to seeing this a lot:


In Workbench 6.1.4 you get it for (almost?) any query with subquery. In Workbench 6.1.6 (released last week), some subqueries work, but it’s still easy to hit a query whose EXPLAIN JSON output confuses workbench.

Looking at the source code, this seems to be just the start of it. The code in MySQL Server is not explicitly concerned with having output of EXPLAIN FORMAT=JSON conform to any convention. Workbench also has a rather ad-hoc “parser” that walks over JSON tree and has certain arbitrary expectations about what nodes should be in various parts of the JSON document. When these two meet, bugs are a certainty. I suspect the real fun will start after a few releases of the Server (fixing stuff and adding new features) and Workbench (trying to catch up with new server while supporting old ones).

My personal interest in all this is that we want to support EXPLAIN JSON in MariaDB. MariaDB optimizer has extra features, so we will have to extend EXPLAIN JSON. I was looking for a way to do it in a compatible way. However, current state of EXPLAIN JSON in MySQL doesn’t give one a chance.

Posted in EXPLAIN, mysql, mariadb on May 23rd, 2014 by spetrunia | | 1 Comments


MariaDB 10.0 has SHOW EXPLAIN feature. It was coded by yours truly and first introduced about a year ago. Last release of MySQL 5.7 introduced EXPLAIN FOR CONNECTION, which looks rather similar to MariaDB’s SHOW EXPLAIN. I was wondering how these two compare.

The basics

The usage scenarios are similar. In both cases, you first need a connection id of a running query. It is typically obtained by running SHOW PROCESSLIST:

MariaDB [(none)]> show processlist;
| Id | User | Host      | db       | Command | Time | State        | Info                        | Progress |
|  2 | root | localhost | dbt3sf10 | Query   |    2 | Sending data | select count(*) from orders |    0.000 |
|  3 | root | localhost | NULL     | Query   |    0 | init         | show processlist            |    0.000 |

Then, in MariaDB, you run:

MariaDB [(none)]> show explain for 2;
| id   | select_type | table  | type  | possible_keys | key           | key_len | ref  | rows     | Extra       |
|    1 | SIMPLE      | orders | index | NULL          | i_o_orderdate | 4       | NULL | 14856637 | Using index |

and you get the EXPLAIN for the query that connection #2 is running. SHOW EXPLAIN always generates a warning, the warning has the text of the query that the EXPLAIN is for:

MariaDB [(none)]> show warnings;
| Level | Code | Message                     |
| Note  | 1003 | select count(*) from orders |

The idea behind this was that without the warning, it will be difficult to tell which query this EXPLAIN belongs to. It could be that the query you saw in SHOW PROCESSLIST has finished and another one has started.

SHOW EXPLAIN works for any query that EXPLAIN works for (starting from MariaDB 10.0.5 EXPLAIN UPDATE/DELETE/INSERT are supported). If you attempt to run SHOW EXPLAIN on a connection that is not running a statement that has EXPLAIN, you will get:

MariaDB [(none)]> show explain for 2;
ERROR 1933 (HY000): Target is not running an EXPLAINable command

Now, let’s take a look at MySQL’s EXPLAIN FOR CONNECTION:

MySQL [(none)]> explain for connection 1;
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
|  1 | SIMPLE      | orders | ALL  | NULL          | NULL | NULL    | NULL | 922880 | Using where |

Looks very similar to SHOW EXPLAIN, but there is no warning with query text. The only way out I can see is to run SHOW PROCESSLIST again, find your query there, and look at the “Time” column. If the value is big enough, this means that the query you see in SHOW PROCESSLIST output was already running when you ran SHOW EXPLAIN.

If the target connection is not running a query, you will get nothing

MySQL [(none)]> explain  for connection 1;
Query OK, 0 rows affected (0.00 sec)

More details

Unlike SHOW EXPLAIN, EXPLAIN FOR CONNECTION supports flags. That is, you can run EXPLAIN [EXTENDED|PARTITIONS|FORMAT=JSON] FOR CONNECTION . However, EXPLAIN EXTENDED will not print the warning with query after rewrites, and EXPLAIN FORMAT=JSON will not print attached_condition entries. I think, these are caused by limitations of EXPLAIN FOR CONNECTION code.

There are cases where EXPLAIN FOR CONNECTION will produce outputs like:

MySQL [(none)]> explain for connection 1;
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                |
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Plan isn’t ready yet |

In some cases, a part of the plan is not ready:

MySQL [(none)]> explain for connection 1;
| id | select_type | table    | type | possible_keys | key  | key_len | ref  | rows   | Extra                |
|  1 | PRIMARY     | NULL     | NULL | NULL          | NULL | NULL    | NULL |   NULL | Plan isn’t ready yet |
|  2 | SUBQUERY    | lineitem | ALL  | NULL          | NULL | NULL    | NULL | 974084 | NULL                 |

SHOW EXPLAIN in MariaDB could produce similar partial query plans, but after release 10.0.5, there is always a full query plan. It would be interesting to discuss the reasons for this, but the discussion won’t fit into this blog post.

Another interesting question is whether MySQL 5.7’s EXPLAIN FOR CONNECTION allows for printing of query plan into slow query log. Before MariaDB 10.0.5, SHOW EXPLAIN code didn’t allow this. Changes in MariaDB 10.0.5 made printing EXPLAIN at the query end easy, and now MariaDB can save EXPLAINs in the Slow Query Log. Will MySQL 5.7 follow and also add such feature?

Posted in Uncategorized, EXPLAIN, mysql on November 14th, 2013 by spetrunia | | 0 Comments

EXPLAIN: "Using join cache" renamed to "Using join buffer"

As soon as I’ve finished writing this post about “Using join cache”, it was apparent that “Using join cache” is poor wording. First, the corresponding server variable is called @@join_buffer_size, not join cache size, and second, there is really no cache involved.

We’ve had a discussion about how this should be called. Some sources use term Block nested-loops join but we’ve settled on “Using join buffer“. Another change is that we’ve decided to move the note one line down to the table that “does the buffering”. As a result, what was this

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                         |
|  1 | SIMPLE      | t1    | range |     | Using where; Using join cache |
|  1 | SIMPLE      | t2    | range |     | Using where                   |

became this:

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                          |
|  1 | SIMPLE      | t1    | range |     | Using where                    |
|  1 | SIMPLE      | t2    | range |     | Using where; Using join buffer |

The change was pushed into 5.1.19 tree. MySQL 5.1.18 is the only version that shows “Using join cache”. I’ve also updated the original explanation post to display the final variants of EXPLAIN output.

Posted in EXPLAIN on June 10th, 2007 by spetrunia | | 3 Comments

Use of join buffer is now visible in EXPLAIN

* s/Using join cache/Using join buffer/, changed to show the final variants of EXPLAIN output as described here
* s/join_buff_size/join_buffer_size/

Starting from 5.1.18, EXPLAIN output may show “Using join cache“, like in this example:

mysql> explain select * from t1, t2 where t1.col < 10 and t2.col < 'bar';
| id | select_type | table | type  |     | Extra                          |
|  1 | SIMPLE      | t1    | range |     | Using where                    |
|  1 | SIMPLE      | t2    | range |     | Using where; Using join buffer |

The join cache is actually not a new feature. It has been available in MySQL at least since version 4.0, and for all this time it has remained invisible and undocumented. The only thing that indicated its presense was the @@join_buffer_size server variable.

We’re trying to gradually make EXPLAIN show more information. Georgi Kodinov was fixing BUG#27531 and has used that occasion to make join buffering show up in EXPLAIN output.

If you already know how MySQL’s join buffering works, that’s all the news. If not, the remainder of this post has a hopefully readable explanation of how join buffering works and when it is used.

How join buffering works

Let’s start with regular Nested Loops Join. Suppose we have a join query

select * from t1, t2, t3 where t2.key1=t1.col1 and t3.key1<40;

and the query plain is like shown in this EXPLAIN output:

    | table | type  | possible_keys | key  | key_len | ref          | rows | Extra       |
    | tbl1  | ALL   | NULL          | NULL | NULL    | NULL         |   10 |             |
    | tbl2  | ref   | key1          | key1 | 5       | db.tbl1.col1 |    2 | Using where |
    | tbl3  | range | key1          | key1 | 5       | NULL         |   40 | Using where |

When no join buffering is used, the query will be executed as follows:

  for each record t1rec in table tbl1
    for each record t2rec in tbl2 such that t2rec.key1=t1rec.col
      for each record t3rec in tbl3 such that t3rec.key1<40
        pass the (t1rec, t2rec, t3rec) row combination to output;

Graphically the execution flow can be depicted as follows (yellow are the table scans, blue are the table rows):


From the code and picture we see that:

  • Table tbl2 is scanned several times, but each scan accesses a different part of the table
  • Table tbl3 is scanned many times, and all performed scans are identical

It is apparent that the second and the third scans of table tbl3 bring no new information and can be removed. We do not have to re-scan tbl3 for any row combination from tables tbl1, tbl2. Instead, we could accumulate a back of such row combination, and then do one tbl3 scan for all of them. And this is what join buffering is.

In pseudo-code, the execution will look as follows:

  for each record t1rec in table tbl1
    for each record t2rec in tbl2 such that t2rec.key1=t1rec.col
      put (t1rec, t2rec) into the buffer
      if (buffer is full)

  flush_buffer() {
    for each record t3rec in tbl3 such that t3rec.key1<40
      for each record in the buffer
        pass (t1rec, t2rec, t3rec) row combination to output;
    empty the buffer;

And graphically it will look as follows:


The EXPLAIN output will be as follows:

explain select * from t1,t2,t3 where t2.key1 = t1.col1 and t3.key1<40;
    | table | type  | possible_keys | key  | key_len | ref          | rows | Extra                          |
    | t1    | ALL   | NULL          | NULL | NULL    | NULL         |   10 |                                |
    | t2    | ref   | key1          | key1 | 5       | test.t1.col1 |    2 | Using where                    |
    | t3    | range | key1          | key1 | 5       | NULL         |   40 | Using where; Using join buffer |

In this example join buffering is used for one table, but it can be used for several tables as well. MySQL uses join buffering whenever it can, access to some table tbl_x will be bufferred if

  • The SELECT does not have an ORDER BY clause
  • We’re not at top level “select” of a multi-table UPDATE
  • tbl_x is accessed using an “independent” access method: ALL, index, range, or index_merge.
  • tbl_x is not inner w.r.t. some outer join

The server variable @@join_buffer_size specifies how much MySQL should allocate for each buffer. That is, if two tables use buffering, MySQL will allocate two buffers of @@join_buffer_size bytes each.

Posted in EXPLAIN, how-it-works on May 16th, 2007 by spetrunia | | 6 Comments