Optimization for branched WHERE

Question

I have a query that executes against a table that can grow to millions of rows. The query comes out of a QA tool that we use that is outside the standard functionality of the DB (as far as what is indexed and how and why). The query is:

SELECT id FROM thisTable t
WHERE col = 'val'
AND ((not exists (SELECT 1 FROM thisTable WHERE refid = t.id) and refbool = 0) or refbool = 1)
ORDER BY newid()

Basically, let's say the table has id, refid, refbool, and col columns. So you could have data as follows:

  id  |  refid  |  refbool  |  col
------------------------------------
   1  |   NULL  |    0      |  val
   2  |   NULL  |    0      |  val
   3  |   NULL  |    0      |  val
   4  |    2    |    1      |  val
   5  |   NULL  |    0      |  val
   6  |    1    |    1      |  val

The query should never pick the rows for id in (1, 2) because they are referred to by other rows. It should only grab rows where refbool = 1, OR refbool = 0 AND that row's id is not any other row's refid. This statement is horribly non-performant, but I'm not sure what a better query would look like for this. Assume that no indexes, views, stored procedures, or other underlying machinations can be added - it must be a query.

The overall query is significantly larger, JOINS to two additional tables and gathers quite a bit of data. However, I've narrowed it does to this particular bit as commenting out this line takes the query execution time from 16s to <1s.

I'm also reordering the rows by newid() as I need to randomly select a sample item. Removing the ORDER BY also make the query significantly faster even leaving the third row in. It appears that the two operations combined causes the slowness. I've tried designing a CTE, but have failed to increase performance doing so.

I've looked at the execution plan. There are indices that would be added that would improve this query. However, performance of internal QA tools do not take precedence over performance in client production environments, and making changes to the structure in a QA environment for the utility in relation to indices, etc. invalidates it's usefulness as a QA environment as it will likely perform differently than a production environment.

I could certainly write a query that would perform worse than my current query by changing the logic of the query itself. I'm sure we all could. I'm asking to apply that sort of reasoning to improve the performance of the query instead.

Is refid always null when refbool equals 0? In other words, does your sample data faithfully reproduce actual data? — mustaccio, Aug 06 '20 at 20:45
That’s actually a good question. I believe it is the case, though that makes me wonder why the Boolean column exists at all. I’d have to dig into the sprocs to be absolutely certain. — Jesse Williams, Aug 06 '20 at 20:50

score 5 · Accepted Answer · answered Aug 09 '20 at 05:47

An execution plan was not included, but the typical issue with queries like this (sorting aside) is the optimizer choosing a nested loops anti semi join without a good supporting index. It can also be a rogue top (1), or a poorly-performing transformation to a semi-join with nested start-up filters and an anti-semi join.

Regardless, there are two usual workarounds:

Rewrite the OR manually as a UNION (or, if guaranteed disjoint, as a UNION ALL).
Rewrite the NOT EXISTS as a left join filtering the preserved side for NULL.

The following incorporates both:

DECLARE @thisTable table
(
    id integer PRIMARY KEY,
    refid integer NULL,
    refbool bit NOT NULL,
    col varchar(10) NOT NULL
);
INSERT @thisTable
    (id, refid, refbool, col)
VALUES
    (1, NULL, 0, 'val'),
    (2, NULL, 0, 'val'),
    (3, NULL, 0, 'val'),
    (4,  2  , 1, 'val'),
    (5, NULL, 0, 'val'),
    (6,  1  , 1, 'val');

SELECT
    U.id
FROM 
(
    -- T.refbool = 1
    SELECT T.id 
    FROM @thisTable AS T
    WHERE 
        T.col = 'val'
        AND T.refbool = 1

    -- Or (disjoint)
    UNION ALL

    -- T.refbool = 0 and not exists
    SELECT T.id 
    FROM @thisTable AS T
    LEFT JOIN @thisTable AS T2
        ON T2.refid = T.id
    WHERE 
        T.col = 'val'
        AND T.refbool = 0
        AND T2.id IS NULL
) AS U
ORDER BY 
    CHECKSUM(NEWID());

db<>fiddle online demo

For more alternatives on random ordering see the existing Q & A:

What is the best way to get a random ordering?.

Don't just try the top answer.

score 3 · Answer 2 · answered Dec 21 '20 at 00:32

There is one other option that Paul White has not considered. This is that the optimizer does not consider a BIT field to be constrained to 0 or 1, and therefore this can force it into a concatenated nested loops in order to conform with the OR predicate, when a hash or merge may work better.

A better option may be as follows:

SELECT id FROM thisTable t
WHERE col = 'val'
AND not exists (SELECT 1 FROM thisTable t2 WHERE t2.refid = t.id AND t1.refbool = 0)

What this does is rephrase what you are trying to do. You originally wrote: Give me all rows in table where either refbool=1, or refbool=0 and no matching rows in subquery.

Now you have: Give me all rows in table where there is no matching row in subquery for which refbool in the outer table=0.

This usually results in a regular anti-join, with a startup join predicate, and may get a hash or merge. Given that a BIT can only be 0 or 1, if refbool is 1 then the right side of the anti-join never returns a row, therefore the left side will return. Equally, if refbool=0 then the right side may or may not return a row, thereby preventing a row coming out of the left side.

This ends up with the same result as the original query. But it only works correctly if refbool is a BIT field and NOT NULL, or otherwise constrained such that the side of the OR in which the anti-join is NOT executed can be elided. This also works with semi-joins in the opposing fashion.

This is a long-standing bug-bear of mine, that in many cases SQL Server will not reason about the possibilities remaining after taking account of a predicate. This includes not just bit columns, but also columns with check constraints. It appears a lot when using a filtered index, and the predicate in the query is an inequality, here is an example: Query Plan. The same thing happens in an equality predicate where an anti-join strategy against the filtered index woud have been appropriate. But as you have seen, it's not just a limitation in filtered indexes (of which there are many).

Optimization for branched WHERE

2 Answers2