Fun MySQL fact of the day: duplicate weedout

Last week, we started looking at some of MySQL's semi-join optimisations for subqueries. We've already covered materialisiation, FirstMatch, LooseScan, and table pullout optimisations, but there's still one more to clover: the DuplicateWeedout optimisation, and then thistle be the last topic we discuss about about subqueries.

If you remember from last week, one of the guarantees that MySQL makes when resolving queries with subqueries is that you will not receive duplicate rows from the outer table as a result of multiple matches in a subquery. But because, as we've discussed, MySQL uses a nested loop join (or some variant thereof) to resolve a subquery, this takes extra work. With both LooseScan and FirstMatch, MySQL is able to "short-circuit" the nested loop join; however, this may not always be the cheapest approach. And this is when MySQL may choose to use the DuplicateWeedout optimisation.

When MySQL deems that the DuplicateWeedout semi-join optimisation is the "cheapest" way to answer a query with a subquery, MySQL will create a temporary "weedout" table with a single, unique (hence indexed) VARBINARY column, rowids, using the default storage engine. Then, after all table joins for a given row in the outer table(s) have been resolved, MySQL will calculate a unique hash of the row IDs for the outer table and check for its presence in the temporary "weedout" table. If the row has already been seen before, the current row is dropped and execution continues. Otherwise, the new row ID is inserted into the "weedout" table and the row is returned (either to the client or to the next applicable join). You'll know when MySQL chooses the DuplicateWeedout strategy when you see Start temporary and End temporary in the Extra of your EXPLAIN.

Now, if you spent the weekend like me, pulling out weeds, after plantain even more ideas about weeds in your head, you may, in fact, be going nutsedge. But don't worry: I hear you thinking, "kudzu please stop?". And I will. Until tomorrow.