Fun MySQL fact of the day: collation filtering

Yesterday, we considered some cases where MySQL may indicate Using where; Using index in the Extra field of a query plan. Today, we're going to look at one more case that existed up until MySQL 5.7, and while I haven't made a habit of digging up too many bugs as fun facts, this one was a shocker to me, personally, and is something you'll want to keep in mind.

In MySQL (in general, really), a collation is a set of rules that governs the way character sets are evaluated. For example, a character set can have a case-insensitive collation that considers 'a' and 'A' as equal (which is the out-of-the-box default) or one that considers 'Ö' and 'OE' equal. Another type of collation is padding: PAD SPACE vs. NO PAD. These properties specify the equality of 2 string types by either ignoring or not ignoring the white-space trailing a string. For example, in a PAD SPACE collation, 'a' equals 'a ', but not in a NO PAD collation. Until MySQL 8.0.0, all collations were PAD SPACE, and you can demonstrating it by running SELECT 'a ' = 'a';.

Over the last couple days, we've seen a couple situations where the MySQL query optimiser may or may not decide to do additional filtering on a query's results. This decision is based on a set of rules dictating when MySQL can "eliminate" a specific condition that MySQL knows is guaranteed to be true. For example, if a query with WHERE a=1 AND b=1 is run against a covering index in InnoDB, (a,b), MySQL knows that it doesn't have to do any extra filtering (InnoDB won't lie) and can drop both conditions on a=1 and b=1 thereby preventing any supplemental filtering. Now, consider a similar SELECT ... WHERE a='a ': in a collation with PAD SPACE, we expect all records starting with a and ending zero or more spaces. But, until MySQL 5.7, the optimiser assumed otherwise and excluded all string types from condition elimination and always filtered them (bug #66983).

Now, when I said this was a bug, well, it was when all you have are PAD SPACE collations. But for NOPAD collations, this isn't a bug: when you use NOPAD on a CHAR type field, you will continue to get supplemental filtering due to the way storage engines may choose to optimise CHAR storage. Thankfully, it's all fixed up the best it can be and you just have to remember the trade-offs of collation when designing your table.