2019-05-17 23:06 UTC

Fun MySQL fact of the day: loose scan semi-join

Yesterday, we considered MySQL's FirstMatch semi-join optimisation, and while it's pretty simple and fairly straight-forward, some queries may just not benefit from it.

Let's go back to our on-going example relation of employee->company and imagine a company table with tens of millions of companies. Let's also imagine that each company may hire many tens to many tens-of-thousands of employees, of which likely less than, say, 10% have the same name. Now, let's further imagine we have need to find all companies that employ at least 1 individual named "Jane". Because we want to have a unique list of companies, one possible way of writing this query is as follows:

EXPLAIN SELECT id, name
FROM company
WHERE id IN (
    SELECT company_id
    FROM employee
    WHERE full_name = 'JANE'
);

For sanity's sake, let's say the employee table has a secondary index, (full_name,company_id). Now, without any type of semi-join optimisation, we have a pretty good idea that MySQL will do a full table scan on company and then look for any employee named "Jane" in the employee table for each row in company. Naturally, this will be slow, and even if we go back to the FirstMatch optimisation we discussed yesterday, MySQL will still need to perform a full table scan on company. This even holds true for table pullouts and materialization. There must be a better way! And there is: the LooseScan optimisation, which will invert the query, first looking for all employees named "Jane" and then, for each match, join the company table by primary key using company_id. Once MySQL finds a match, it will skip to the next "Jane" at the next company_id, skipping over all other "Jane"s at the current company.

Now, if you've used MySQL for a while, you may have heard the term "Loose Index Scan" before. It isn't something new to MySQL, but historically, it had been used only when performing a GROUP BY. While these two optimisations do both use a loose scan over a grouping of keys, when you're using a subquery, you'll know MySQL has chosen a LoseScan semi-join optimisation when the Extra column of your EXPLAIN shows LoseScan. This is not true for a GROUP BY, but that's a fun fact for another day.