EXPLAIN FORMAT=JSON vs regular EXPLAIN

Valeriy has mentioned that MySQL manual has no documentation about EXPLAIN FORMAT=JSON feature. There was a bug filed for this since September, 2012. I had spent some time looking at EXPLAIN FORMAT=JSON, so I thought I would share my findings.

The first blog post about EXPLAIN FORMAT=JSON was made by Sheeri. She listed these features of FORMAT=JSON:

  • attached_condition and implicit casts
  • used_key_parts field

In addition to those, I was able to find

  • Pushed index condition information
  • More information about sorting/grouping
  • Subquery attachment points information

Let’s look at these in greater detail:

Pushed index condition information

This is very similar to conditions attached to tables - instead of “Using index condition” one can see the pushed condition itself:

explain format=json select * from Country where Name like 'p%' and Population > 3*1000*1000;
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "Country",
      "access_type": "range",
      "possible_keys": [
        "Name",
        "Population"
      ],
      "key": "Name",
      "used_key_parts": [
        "Name"
      ],
      "key_length": "52",
      "rows": 12,
      "filtered": 100,
      "index_condition": "(`world`.`Country`.`Name` like 'p%')",
       “attached_condition”: “(`world`.`Country`.`Population` > (((3 * 1000) * 1000)))” 
    }
  }

This is very useful. I wish the condition was printed without the over-quoting of every identifier.

More information about sorting/grouping

The old “Using temporary; Using filesort” was replaced with ordering_operation and grouping_operation nodes. The nodes always have using_filesort attribute, and may optionally have using_temporary_table attribute.

explain format=json select * from City order by Population desc limit 1;
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    “ordering_operation”: {
      “using_filesort”: true,
      “table”: {
        “table_name”: “City”,
        “access_type”: “ALL”,
        “rows”: 4249,
        “filtered”: 100
      }
    }
  }
}

This provides slightly more information than the old “Using temporary; Using filesort” line. In the following example, one can see that “Using temporary” is done to handle GROUP BY, while filesort is used for GROUP BY:

explain format=json
select sum(population) as city_ppl, Country
from City
group by Country 
order by city_ppl desc;
*************************** 1. row ***************************
EXPLAIN: {
  “query_block”: {
    “select_id”: 1,
    “ordering_operation”: {
      “using_filesort”: true,
      “grouping_operation”: {
        “using_temporary_table”: true,
        “using_filesort”: false,
        “table”: {
          “table_name”: “City”,
          “access_type”: “ALL”,
          “rows”: 4409,
          “filtered”: 100
        }
      }
    }
  }
}

It is also possible to encounter ordering_operation with using_filesort:false, which means that the optimizer relies on the query plan to produce the data in required ordering.

Subquery attachment points

EXPLAIN FORMAT=JSON shows where the subqueries are attached to queries. In tabular form, you see something like this:

+----+--------------------+-----------+------+---------------+------+---------+------+------+-------------+
| id | select_type        | table     | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+--------------------+-----------+------+---------------+------+---------+------+------+-------------+
|  1 | PRIMARY            | Country   | ALL  | Name          | NULL | NULL    | NULL |  165 | Using where |
|  3 | SUBQUERY           | satellite | ALL  | NULL          | NULL | NULL    | NULL | 1041 | NULL        |
|  2 | DEPENDENT SUBQUERY | City      | ALL  | NULL          | NULL | NULL    | NULL | 4409 | Using where |
+----+--------------------+-----------+------+---------------+------+---------+------+------+-------------+

and it is not apparent how subqueries are attached to queries. For basic queries, this is not a problem - just look at the query. However, if the query has VIEWs or FROM clause subqueries, it may be difficult to recognize the query plan structure. In any case, with EXPLAIN FORMAT=JSON, subqueries look like this:

*************************** 1. row ***************************
EXPLAIN: {
  “query_block”: {
    “select_id”: 1,
    “table”: {
      “table_name”: “Country”,
      “access_type”: “range”,
      “possible_keys”: [
        “Name”
      ],
      “key”: “Name”,
      “used_key_parts”: [
        “Name”
      ],
      “key_length”: “52″,
      “rows”: 2,
      “filtered”: 100,
      “index_condition”: “(`world`.`Country`.`Name` in (’France’,'Germany’))”
    },
    “select_list_subqueries”: [
      {
        “dependent”: true,
        “cacheable”: false,
        “query_block”: {
          “select_id”: 2,
          “table”: {
            “table_name”: “City”,
            “access_type”: “ALL”,
            “rows”: 4409,
            “filtered”: 100,
            “attached_condition”: “(`world`.`City`.`Country` = `world`.`Country`.`Code`)”
          }
        }
      }
    ]
  }
}

which makes it more apparent where the subquery is attached and how often it is re-executed.

My personal impressions

EXPLAIN FORMAT=JSON looks like a project that was started with good intents but wasn’t finished. I’ve found a couple of bugs by just playing around (1, 2). A number of things look out of place:

  • Range checked for each record” still displays “index map: 0xNNNN”, leaving one to do binary arithmetic with index numbers to figure out which indexes will be checked
  • We’ve got used_key_parts. Is there any need to print key_length anymore?
  • index_merge is still described with notation like key=”intersect(key1,key2)”, and key_length=”5,5″ (where’s the used_key_parts for it?). This notation was the only way to fit the data into tabular format. With JSON, it would be natural to represent individual index scans as separate JSON objects
  • If a subquery is part of WHERE clause, attached_condition will print the whole subquery. That is, subquery’s WHERE condition is printed both in the parent query and in the subquery. This is redundant and annoying.
  • Items with default value are generally not printed. For example, possible_keys=NULL is not. However, using_filesort:false is printed, and so is filtered:100, which means no filtering happened. Why?

Overall, I don’t quite understand what EXPLAIN FORMAT=JSON was targeted at. If EXPLAIN FORMAT=JSON was targeted at humans, I would expect a more user-friendly output. Don’t print redundant information, don’t do redundant quoting, do a better job at indenting the JSON output.
If EXPLAIN FORMAT=JSON was targeted at machines, I would expect it to be fully documented. How can one expect somebody to write programs to process a data format that has no documentation?

Posted in Uncategorized on July 23rd, 2013 by spetrunia | | 0 Comments

MySQL 5.6, InnoDB statistics, and testcases for optimizer

One of the new features in MySQL 5.6 is persistent table statistics. The word “persistent” makes you think that table statistics will be very stable from now on.
This is generally true, but there are some exceptions. There is a particular case where InnoDB statistics may change with no warning, even when there is no user activity on the server. Consider the following pattern (which nearly all bug or feature testcases follow):

CREATE TABLE t1 (...) ENGINE=INNODB;
INSERT INTO t1 VALUES (1),(2),(3),(4),(5);
EXPLAIN SELECT * FROM t1;
EXPLAIN SELECT * FROM t1;

With MySQL 5.6, the following is possible

  • the first EXPLAIN will show that the optimizer expects table t1 to have 6 rows
  • the second EXPLAIN will show that the optimizer expects table t1 to have 5 rows

For me, this was a big surprise. After all, table t1 never had 6 rows. If there was some rounding which rounded up 5 to 6, why did the number go back down to 5 again?

I debugged it, and discovered that

  1. Multi-row INSERT will increment #records-in-table statistics as many times as many rows it has inserted
  2. There is also dict_stats_thread() in storage/innobase/dict/dict0stats_bg.cc. which re-calculates table statistics in the background.

And what happens is:

  1. INSERT inserts a record
  2. dict_stats_thread() comes and re-calculates the statistics (the new record is counted)
  3. INSERT increments the statistics by one (because it has inserted a record on step#1)

This way, right after INSERT has finished, the table statistics will report more records. The number will go back down within a second (because dict_stats_thread() will re-calculate statistics again). This slight and temporary inaccuracy in statistics should not be a problem for production environments. This is a big deal for test cases in mysql-test suite, though. Test cases are special:

  1. Test datasets are often small. Even an off-by-one difference may cause a different query plan to be chosen. For optimizer bugs, this means that the testcase might be no longer hitting the bug it was made for.
  2. Test cases often include EXPLAIN output in .result file. EXPLAIN output has “rows” column, and mysql-test-run framwork will interpret off-by-one difference as a test failure

I’ll repeat myself: the above two properties mean that with MySQL 5.6, it is not possible to write optimizer+InnoDB testcases the way we used to write them before. I’ve observed it in practice - testcases will fail sporadically, depending on how quickly your system is running.

But wait, what about Oracle? How did they manage to develop MySQL 5.6 with this? I was really interested, too, so I dug in the revision history. Oracle developers’ approach was to run ANALYZE TABLE after filling the table. It took them 33 patches to put ANALYZE TABLE everywhere where it was needed (examples: 1, 2, 3, 4, …) but they did it.

The take-away is: if you’re working on a testcase for an optimizer bug in MySQL 5.6 - put an ANALYZE TABLE statement after all the INSERTs but before the SELECTs. This will make your testcase stable.

For the record, we in MariaDB 10.0 did not take the 33 patches with ANALYZE TABLE statements. For now, we have decided to run mysql-test testsuite with --skip-innodb-stats-persistent option. Judging from the EXPLAIN outputs in the tests, that has achieved the same goal (same query plans), and the goal was achieved with just one patch. :-)

Posted in Uncategorized on July 17th, 2013 by spetrunia | | 0 Comments