Getting Started with Adobe After Effects - Part 6: Motion Blur
First Time? You can support us by signing up. It takes only 5 seconds. Click here to sign up. If you already have an account, click here to login.
Loading

SQLServer Quiz 2010 - SQL Server Query Optimizer – Friend or Foe?

  • SQL Server Query Optimizer – Friend or Foe?

    How many of you think SQL Server Query Optimizer is a truly amazing piece of software? I do. It does an excellent work of interpreting your queries into a working query and returns a resultset back to you.

    “Wait a minute! What do you mean by interpreting?”. That’s a good question and I will come up with an example where the Query Optimizer tries hard but doesn’t come up with a good execution plan. The Query Optimizer has many rules of how to make up the best execution it can within a reasonable timeframe.

    There are many examples where the Query Optimizer doesn’t produce the very best plan and in most cases, the reason is that you haven’t given enough information. After this lesson, you will understand that giving more information than reasonable needed, the Query Optimizer can make shortcuts and come to a much better plan than before.

    This example is about using NOT IN predicate, and by just adding one more seemingly redundant predicate filter can make a huge difference.

    First of all, we have to setup a test environment. For all of the following queries, use TempDB.

    Now run this code

    -- Prepare Guesses table
    CREATE TABLE    dbo.Guesses
    (
        GuessID INT IDENTITY(1, 1)NOT NULL,
        PeopleID INT NOT NULL,
        Low INT NOT NULL,
        High INT NOT NULL,
        CONSTRAINT PK_Guesses PRIMARY KEY NONCLUSTERED
        (
        	GuessID
        )
    )
    GO
    ALTER TABLE     dbo.Guesses
    WITH CHECK
    ADD CONSTRAINT  CK_Sorted
    CHECK           (
    Low < High
    )
    GO
    
    CREATE UNIQUE NONCLUSTERED INDEX UX_Guesses ON dbo.Guesses(PeopleID, Low, High)
    GO
    
    CREATE NONCLUSTERED INDEX IX_Guesses ON dbo.Guesses(PeopleID) INCLUDE(Low, High)
    GO
    
    -- Populate Guesses table
    ;WITH cteSource(PeopleID, Low, High)
    AS (
            SELECT      1 +ABS(CHECKSUM(NEWID()))% 100000 AS PeopleID,
    1 +ABS(CHECKSUM(NEWID()))% 10000 AS Low,
    1 +ABS(CHECKSUM(NEWID()))% 10000 AS High
            FROM        master..spt_values AS v1
    INNER JOIN  master..spt_values AS v2 ON v2.type='P'
    AND v2.number BETWEEN 1 AND 100
    INNER JOIN  master..spt_values AS v3 ON v3.type='P'
    AND v3.number BETWEEN 1 AND 100
            WHERE       v1.type='P'
    AND v1.number BETWEEN 1 AND 100
    )
    INSERT          dbo.Guesses
    (
        PeopleID,
        Low,
        High
    )
    SELECT DISTINCT PeopleID,Low,High
        FROM cteSource
        WHERE Low < High
        ORDER BY Low,PeopleID
    GO
    
    -- Prepare Result table
    CREATE TABLE    dbo.Result
    (
        Value INT NOT NULL,
          PRIMARY KEY CLUSTERED
        (
        	Value
        )
    )
    GO
    
    -- Populate Result table
    INSERT  dbo.Result
    (
    Value
    )
    SELECT  1 +ABS(CHECKSUM(NEWID()))% 10000 AS Value
        FROM    master..spt_values
        WHERE   type='P'
        AND number BETWEEN 1 AND 20
    GO
    
    DBCC DBREINDEX('dbo.Result')
    GO
    
    -- Remove dual hits
    DELETE
    FROM    dbo.Guesses
    WHERE EXISTS(SELECT * FROM dbo.Result WHERE Value = Low)
    AND EXISTS(SELECT * FROM dbo.Result WHERE Value = High)
    GO
    
    DBCC DBREINDEX('dbo.Guesses')
    GO
    

    The goal of the query is to return all records (about 2,000 of 500,000 in my tests) that has one (and only one) hit for Low and High value. For example, if Low is 50 and High is 75, this record should only be returned if 50 is present in dbo.Result table and 75 is not present in dbo.Result table. Or the opposite, 75 is present in the dbo.Result table and 50 is not present in the dbo.Result table.

    If both 50 and 75 is present in the dbo.Result table, the record should not be returned. And of course the record should not be returned if both 50 and 75 is missing in the dbo.Result table.

    The original poster wrote a query (Query 1) that looked like this

    SELECT  PeopleID,
    Low,
    High
    FROM    dbo.Guesses
    WHERE   (
    Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    )
    OR
    (
    Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    

    The query is quite understandable and it looks like you have given the Query Optimizer enough information, right? Now, let’s look at the metrics taken from both SQL Profiler and STATISTICS IO.

        CPU4,244
        Duration2,354
        Reads2,995,846
    
        Table 'Result'. Scan count 0, logical reads 2994284.
        Table 'Guesses'. Scan count 3, logical reads 1562.
        Table 'Worktable'. Scan count 0, logical reads 0.
    

    Wow! 2.5 seconds (on a dual core processor) and almost 3 million reads. No wonder original poster wanted a better query. And he did post an alternative, which looked like this (Query 2)

    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    
    UNION
    
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    

    Even if he now forces the table to be scanned twice we do get much better metric! See here

        CPU125
        Duration192
        Reads3,164
    
        Table 'Result'. Scan count 4, logical reads 8.
        Table 'Worktable'. Scan count 0, logical reads 0.
        Table 'Guesses'. Scan count 6, logical reads 3124.
        Table 'Worktable'. Scan count 0, logical reads 0.
    

    The query runs 12 times faster and uses only about 1/1000th the number of reads! This is important, not only from a performance viewpoint, but also from a cost perspective. If the database is resided in a cloud you probably pay per CPU second, per datastorage in MB and sometimes also per disk utilization (reads).

    At this point the original poster turned for help and wanted to see if there was another way to write the query to make if more efficient or to avoid the double scanning.

    As a starting point I took Query 1 as is and just added a seemingly redundant predicate filter.


    The only thing I did was to add a “not equal to“ filter for the outer value used in the NOT IN. Messy? It can look like it but I will try to explain it to you.

    Let’s pretend you have a value (High) of 5 the query is written the way that you want to see that 5 is not present in the IN clause. Well, if 5 is present in the IN clause the filter fails. What the “not equal to” does, is to exclude the obvious value of 5 and to see if there are other values in the IN list that has a match. So here is my first attempt (Query 3)

    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   (
                Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    )
    OR
    (
                Low NOT IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    

    Look closely at the query. It is an exact replica of original poster Query 1, with the exception of the “not equal to” added filter “Value <> Low” and “Value <> High”. What did this do to the metrics? It turns out we got almost as good metrics as Query 2!

        CPU155
        Duration251
        Reads4,874
    
        Table 'Result'. Scan count 2, logical reads 4.
        Table 'Worktable'. Scan count 0, logical reads 0.
        Table 'Guesses'. Scan count 6, logical reads 4826.
        Table 'Worktable'. Scan count 0, logical reads 0.
    

    And still 9 times faster with a 1/600th the number of reads. It’s not as good as Query 2, but much better than Query 1.

    At this point I looked at Query 2 and tried to come up with a query without the UNION using the same technique for Query 3. Here is Query 4.

    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    OR High IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    

    It turns out Query 4 has the same metrics as Query 3!

        CPU151
        Duration258
        Reads4,874
    
        Table 'Result'. Scan count 2, logical reads 4.
        Table 'Worktable'. Scan count 0, logical reads 0.
        Table 'Guesses'. Scan count 6, logical reads 4826.
        Table 'Worktable'. Scan count 0, logical reads 0.
        

    At this point I am starting to look at the presence of worktables. They seem not necessary because we have no sorts nor any group by. The main reason for the existence of worktables at this stage is that we have not chosen proper index strategy. So for remedy this, let’s create two additional indexes like this

    CREATE NONCLUSTERED INDEX IX_Low ON dbo.Guesses(Low) INCLUDE(GuessID, PeopleID, High)
    GO
    
    CREATE NONCLUSTERED INDEX IX_HighON dbo.Guesses(High) INCLUDE(GuessID, PeopleID, Low)
    GO
    

    To make a long story short, let’s now see the new metrics for all four queries.

    --  Query 1
        CPU4,259
        Duration2,363
        Reads2,995,846
    
        Table 'Result'. Scan count 0, logical reads 2994284.
        Table 'Guesses'. Scan count 3, logical reads 1562.
        Table 'Worktable'. Scan count 0, logical reads 0.
    
    -- Query 2
        CPU16
        Duration165
        Reads133
    
        Table 'Worktable'. Scan count 0, logical reads 0.
        Table 'Guesses'. Scan count 40, logical reads 125.
        Table 'Result'. Scan count 4, logical reads 8.
    
    -- Query 3
        CPU0
        Duration155
        Reads129
    
        Table 'Guesses'. Scan count 40, logical reads 125.
        Table 'Result'. Scan count 2, logical reads 4.
    
    -- Query 4
        CPU219
        Duration384
        Reads2,909
    
        Table 'Guesses'. Scan count 2, logical reads 2905.
        Table 'Result'. Scan count 2, logical reads 4.
    

    As you can see, the worktables are now eliminated for Query 3 and Query 4. And now Query 3 behaves better than Query 2. We’ll see the execution plans soon.

    Query 1 is still behaving real bad. The new indexes made no difference at all. Query 2 is still very good and Query 3 became even better. Not by much, but still better. Query 4 is still acceptable.

    execution plans before the two new indexes.

    execution plans after the two new indexes.

    As you can see, Query 3 and Query 4 uses a “bushy plan” look. See more here about bush plan as explained by MVP Itzik Ben-Gan, http://www.windowsitpro.com/article/database-administration/T-SQL-Deep-Dives-Creating-Queries-That-Work-and-Perform-Well/6.aspx

    So by adding a seemingly redundant piece of information you can help the Query Optimizer to make a better decision!

    How come adding the ‘not equal to filter’ made such a difference in performance for Query3 vs Query1?

    Posted on 10-08-2010 00:00 |
    Peso
    99 · 2% · 527

16  Answers  

Subscribe to Notifications
Previous 1 | 2 Next
  • Score
    6

    I believe the answer to this lies in the contradiction removal of the optimizer.

    Consider these two queries based on the same data

    SELECT  PeopleID,
    Low,
    High
    FROM    dbo.Guesses
    WHERE  High NOT IN(SELECT Value FROM dbo.Result)
    
    go
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE  High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    

    With the Second (using Value <> High) , the execution plan contains no mention on the Result table at all.

    So if we break down the 'double-negative' of the predicate, we can see that the not in condition will always be True since the only value that could cause it to be true (HIGH) is explicitly not being returned.

    The rest is down to extrapolation.

    Replied on Oct 8 2010 4:29AM  . 
    dave ballantyne
    111 · 1% · 462
  • Score
    7

    Its magic... :)

    Query 1 and Query 3 are actually have different meanings... In general they will not return the same data. But in this context, they will give same results. Because we deleted dual hits........ By removing dual hits, we ensured that there will not be any rows in Guesses table, where both Low and High of a row exists in results table.

    In query1, the where condition is looks like this..

    ( Low IN(SELECT Value FROM dbo.Result) AND High NOT IN(SELECT Value FROM dbo.Result)) 
    OR
    (Low NOT IN (SELECT Value FROM dbo.Result) AND High IN(SELECT Value FROM dbo.Result))
    

    But this is not required, as we deleted dual hit rows. So, the below condition enough is going to be give us the required output..

    Low IN(SELECT Value FROM dbo.Result) or High IN(SELECT Value FROM dbo.Result)
    

    In query1, the other 2 conditions are making the query engine to believe that there are dual hit rows, its performing clustered index seek for all 4 conditions. By adding the 2dummy conditions, Value <> High and Value <> Low , which makes the below conditions useless....

    High NOT IN (SELECT Value FROM dbo.Result where Value <> High)
    Low NOTIN (SELECT Value FROM dbo.Result where Value <> Low)
    

    adding that 2 conditions , will makes the subquery always returns empty rows... so, the query will become like

    Low IN(SELECT Value FROM dbo.Result) or High IN(SELECT Value FROM dbo.Result)
    

    Optimizer here will understand that they are 2 dummy conditions and it will not evaluate that 2 conditions, which evaluates other 2 conditions only............... So, this query performs fast.

    In summary, Initiaally, out table has data in such a way that dual hits never going to occur. But in query1, we will try to check for dual hit rows, so, it evaluate for all 4 conditions.

    In query3, we gave the hint to optimizer by adding that 2 Not equal statements, inturn making them dummy, so, optimizer will evalualte 2 conditions only..

    This is Magic... But if we didn't delete the dual hit rows initially, The query1 and query3 will give different results.

    Now coming to question of handling nulls between Not In and Not Exists

    Basic difference is "Not In" Clause will exclude nulls from Source Table, while "Not exists" will include NULLs. One disadvantage of "Not in" clause is, if the table we are checking for values contains null values, it will not return any row as output.

    See this example:

    create table Vals(ID int)
    insert into Vals values(10),(20),(null)
    create table Lookups(ID int)
    insert into Lookups values (20),(30),(null)
    -- First query returns No results
    select * from Vals v where Id not in (select ID from Lookups)
    -- Second query returns 2 records
    select * from Vals v where not exists (select * from Lookups where ID = v.ID)
    

    The Reason for this behaviour is as follows.

    While evaluating "Not In " query, Engine will do the "Left Anti Semi Join", with Match predicate "SourceValue is null or DestinationValue is null or sourcevalue = destinationvalue"

    Converting into algorithm,(This is not the way query executed, just for understanding)

    foreach input in source table
    begin
     foreach input is destination table
     begin 
      declare @Counter int = 0 
      if(source input is null or destination input is null or source input = destination input)
      begin
       set @Counter = @Counter + 1;
       break;  // Go out of the outer for loop.
      end
     end
     if(@Counter = 0)
      // add the result to output rowset. 
    end
    

    Coming to execution,

    In First Iteration, 10 is compared with (20,30,null), as destination input is null, it will increases counter and go out.

    as the counter is 1, 10 will not be added to output resultset.

    In First Iteration, 20 is compared with (20,30,null), as 20 is matches with 20, it will increases counter and go out.

    as the counter is 1, 20 will not be added to output resultset.

    In First Iteration, null is compared with (20,30,null), as source input is null, it will increases counter and go out.

    as the counter is 1, null will not be added to output resultset.

    So, First query didn't return any results.

    Now coming to second query, main difference is Match predicate is only "source value = destination value", while follows the same Left Anti Semi join

    Converting into algorithm,(This is not the way query executed, just for understanding)

    foreach input in source table
    begin
     foreach input is destination table
     begin 
      declare @Counter int = 0 
      if(source input = destination input)
      begin
       set @Counter = @Counter + 1;
       break;  // Go out of the outer for loop.
      end
     end
     if(@Counter = 0)
      // add the result to output rowset. 
    end
    

    Coming to execution,

    In First Iteration, 10 is compared with (20,30,null), as No value is matches with 10, it will finish inner for loop and go out.

    as the counter is 0, 10 will be added to output resultset.

    In First Iteration, 20 is compared with (20,30,null), as 20 is matches with 20, it will increases counter and go out.

    as the counter is 1, 20 will not be added to output resultset.

    In First Iteration, null is compared with (20,30,null), as NULL is not equals to null(but remember null is null. but null is not

    equals null),it will finish inner for loop and go out.

    as the counter is 0, null will be added to output resultset.

    So, Second query returns 2 rows. (10 and null)

    Finally I will tell its our friend............ :) we are fooling it.............. (ofcourse, some times its fools us also). Friends always fool each other.. :)

    Replied on Oct 8 2010 5:17AM  . 
    Ramireddy
    2 · 41% · 12972
  • Score
    5

    How come adding the ‘not equal to filter’ made such a difference in performance for Query3 vs Query1?

    The very simple answer is that Query1 and Query3 are not equivalent. It is easy to be fast if you dont need to be correct...

    Query3 looks like this:

    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   (
                Low IN(SELECT Value FROM dbo.Result)
                AND 
                High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High)
            )
            OR
            (
                Low NOT IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
                AND 
                High IN(SELECT Value FROM dbo.Result)
            )
    

    This is actually eqivalent to

    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   
            Low IN(SELECT Value FROM dbo.Result)
            OR
            High IN(SELECT Value FROM dbo.Result)
    

    which is a much easier query than Query1

    The only reason Query1 and Query3 returns the same data in this case is because you have deleted "duplicate hits" from the guesses table. If you remove the deletion of duplicate hits you will see clearly that Query1 and Query3 are not equivalent. The number of returned rows will be different.

    The interesting thing is that the optimizer is good enough to realize that the condition High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High) is always true and can be removed.

    To see that this is the case look at the inverted condition:

    High IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    

    It is obvious that High can not be present in the set of values returned from Result WHERE Value<>High, so this condition is always false.

    This means that NOT High IN(SELECT Value FROM dbo.Result WHERE Value <> High) is always true.

    A better question is actually why Query2 is so much more efficient than Query1. These queries are completely equivalent so ideally they should result in the same execution plan.

    Replied on Oct 8 2010 11:07AM  . 
    Stefan_G
    22 · 9% · 2734
  • Score
    1

    Query 1 and 3 are not identical, since Q1 will exclude duplicate hits whereas D3 will include duplicates, if present.

    Since the two queries are not identical in terms of result set under various conditions (e.g. including duplicates) and just return the same data because of the forced removal of duplicates the difference is not caused by the ‘not equal to filter’ but rather by the easier query (no double check for duplicates).

    Replied on Oct 8 2010 3:41PM  . 
    lmu92
    0 · 0% · 0
  • Score
    3

    My best guess is that when you put in the additional clause inside the inner query, the “Not IN” clause need not run as inner query will never return any value that is equal to the outer predicate. Normally, “Not In” clause is very expensive as it has to completely run the inner query and then compare with the value of the outer query to see if that specific record should be returned or not. It is not very efficient set operation to get this process run one by one record.

    Replied on Oct 8 2010 3:45PM  . 
    rpathak
    299 · 0% · 145
  • Score
    5

    Explain the difference between NOT EXISTS and NOT IN regarding NULL values.

    NOT EXISTS will not be affected by the existance of NULL values in the result of the inner query because NOT EXISTS cares about the existance of at least one row that satisfy the conditions (equality in this case).

    On the other hand NOT IN will be affected by the existance of NULL values in the result of the inner query because NOT IN cares about assuring that none of the values that returned equal the outer value, and in case of existance of NULL value then we can't assure that, and the the result would be either false or unknown for all outer query rows, hence empty result.

    We can put that under test through this example:

    SELECT T1.value
      FROM (VALUES(1)) AS T1(value)
     WHERE value NOT IN(SELECT T2.value
                          FROM (VALUES(2), (NULL)) AS T2(value))
    
    SELECT T1.value
      FROM (VALUES(1)) AS T1(value)
     WHERE NOT EXISTS(SELECT T2.value
                        FROM (VALUES(2), (NULL)) AS T2(value)
                       WHERE T2.value = T1.value)
    

    The results confirm that the first query (NOT IN) returned empty result, while the the second query (NOT EXISTS) returned one row of value 1.

    How come adding the 'not equal to filter' made such a difference in performance for Query3 vs Query1?

    Adding the 'not equal to filter' to the inner query of NOT IN makes that inner query useless because it will not be possible to get a value that will affects on the result of the outer query. Even NULL value will not affects on the result. To but this under test we can modify the previous script to be:

    SELECT T1.value
      FROM (VALUES(1)) AS T1(value)
     WHERE value NOT IN(SELECT T2.value
                          FROM (VALUES(2), (NULL)) AS T2(value)
                         WHERE T2.value <> T1.value)
    

    The results confirm that adding the 'not equal to filter' changed the result from an empty result to one row result of 1 value (same result of NOT EXISTS) because NOT IN was removed by the optimizer.

    So as you said SQL Server Query Optimizer is a truly amazing piece of software.

    Replied on Oct 8 2010 5:22PM  . 
    Muhammad Al Pasha
    27 · 6% · 1920
  • Score
    9

    Question #2 (In answer screen): Explain the difference between NOT EXISTS and NOT IN regarding NULL values.

    Answering this first becuase the answer is shorter.

    NOT IN evaluates its correlation on the single column of the SELECT clause and additional optional coluns in its WHERE clause. NOT EXISTS evaluates correlation only in the subquery's WHERE clause and the SELECT clause is completely ignored. NULLs being spewed in the NOT IN()'s SELECT are evaluated as <>, and NULLs fail the check. With NOT EXISTS(), it don't matter what gets spewed in the SELECT because it its ignored, but the correlation in the WHERE will still be affected by NULLs (comparing with them yields 'UNKNOWN' which is not 'TRUE' or 'FALSE' that = or <> expects)

    Question #1 (Question details screen): How come adding the ‘not equal to filter’ made such a difference in performance for Query3 vs Query1?

    Query # 1 activated a short-cut code branch in the optimizer. It caused it to jumpto conclusions when it saw the valid use of a fully covering (clustered) fully distinct (primary key) index. This is actually a bad decision in this scenario. The code semantics in Query #3 spoiled that short-cut so the optimizer went into full optimization with that query and produced a better plan.

    Proof of the amount of effort spent in optimization can be obtained by running this and looking at the 'SQL Server parse and compile time' output above the '(## row(s) affected)'. Query #3's parse and compile time is triple that of Query #1.

    SET STATISTICS TIME ON
    
    DBCC FREEPROCCACHE
    GO
    
    --Query 1
    SELECT PeopleID, Low, High
    FROM dbo.Guesses
    WHERE
        (
            Low IN(SELECT Value FROM dbo.Result)
            AND High NOT IN(SELECT Value FROM dbo.Result)
        )
        OR (
            Low NOT IN(SELECT Value FROM dbo.Result)
            AND High IN(SELECT Value FROM dbo.Result)
        )
    OPTION (MAXDOP 1)
    
    DBCC FREEPROCCACHE
    GO
    --Query 3
    SELECT PeopleID, Low, High
    FROM dbo.Guesses
    WHERE
        (
            Low IN(SELECT Value FROM dbo.Result)
            AND High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High)
        )
        OR
        (
            Low NOT IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
            AND High IN(SELECT Value FROM dbo.Result)
        )
    OPTION (MAXDOP 1)
    
    Replied on Oct 8 2010 6:15PM  . 
    Jesse Roberge
    81 · 2% · 707
  • Score
    8

    I added another version of the query and ran 5 of them first without adding extra indexes:

    set statistics io on
    --Query 1
    SELECT  PeopleID,
    Low,
    High
    FROM    dbo.Guesses
    WHERE   (
    Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    )
    OR
    (
    Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    print 'Number of records Query1: ' + cast(@@Rowcount as varchar(20))
    --Query2
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    
    UNION
    
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    print 'Number of records Query2: ' + cast(@@Rowcount as varchar(20))
    --Query3
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   (
                Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    )
    OR
    (
                Low NOT IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    print 'Number of records Query3: ' + cast(@@Rowcount as varchar(20))
    --Query 4
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    OR High IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    print 'Number of records Query4: ' + cast(@@Rowcount as varchar(20))
    -- Naomi's versions
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses G
    inner Join dbo.Result R
    on G.Low = R.Value and not exists (select 1 from dbo.Result R1 where G.High = R1.Value) 
    UNION ALL 
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses G
    inner Join dbo.Result R
    on G.High = R.Value and not exists (select 1 from dbo.Result R1 where G.Low = R1.Value)
    
    print 'Number of records Naomi''s version: ' + cast(@@Rowcount as varchar(20))
    

    And I got these results:

    Table 'Result'. Scan count 0, logical reads 2994632, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 9, logical reads 1574, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Query1: 1986
    Table 'Result'. Scan count 4, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 18, logical reads 3148, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Query2: 1986
    Table 'Result'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 18, logical reads 3094, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Query3: 1987
    Table 'Result'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 18, logical reads 3094, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Query4: 1987
    Table 'Result'. Scan count 4, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 18, logical reads 3148, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Naomi's version: 1986
    

    As you see from the above, query 3 and 4 returned wrong number of records (I commented out the dual hits removal), so they should not be even considered - they are not equivalent of the original query. So, I eliminated queries 3 and 4 from the consideration (as not valid) and ran another test:

    set statistics io on
    --Query 1
    
    SELECT  PeopleID,
    Low,
    High
    FROM    dbo.Guesses
    WHERE   (
    Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    )
    OR
    (
    Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    print 'Number of records Query1: ' + cast(@@Rowcount as varchar(20))
    --Query2
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    
    UNION
    
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    print 'Number of records Query2: ' + cast(@@Rowcount as varchar(20))
    /*
    --Query3
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   (
                Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    )
    OR
    (
                Low NOT IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    print 'Number of records Query3: ' + cast(@@Rowcount as varchar(20))
    --Query 4
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    OR High IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    print 'Number of records Query4: ' + cast(@@Rowcount as varchar(20))
    */
    -- Naomi's versions
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses G
    inner Join dbo.Result R
    on G.Low = R.Value and not exists (select 1 from dbo.Result R1 where G.High = R1.Value) 
    UNION ALL 
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses G
    inner Join dbo.Result R
    on G.High = R.Value and not exists (select 1 from dbo.Result R1 where G.Low = R1.Value)
    print 'Number of records Naomi''s version: ' + cast(@@Rowcount as varchar(20))
    

    And I got the following results.

    Table 'Result'. Scan count 0, logical reads 2994632, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 9, logical reads 1574, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Query1: 1986
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 40, logical reads 126, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Result'. Scan count 4, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Query2: 1986
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Guesses'. Scan count 40, logical reads 126, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Result'. Scan count 4, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Number of records Naomi's version: 1986
    

    As we see, the original query still performed very poor, but the other 2 queries benefited from the new indexes greatly.

    To get the idea which version (version 2 or my new version) performed better, I ran only last 2 queries: and got 55% vs. 45% result (so my query performed slightly better).

    Finally, there is no need to use UNION since we can use UNION ALL in the second query. I re-wrote the query to use UNION ALL and also added an exists version and now I got identical results for all 3 queries:

    --Query 1
    
    SELECT  PeopleID,
    Low,
    High
    FROM    dbo.Guesses
    WHERE   (
    Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    )
    OR
    (
    Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    print 'Number of records Query1: ' + cast(@@Rowcount as varchar(20))
    --Query2
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result)
    
    UNION ALL
    
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low NOT IN(SELECT Value FROM dbo.Result)
    AND High IN(SELECT Value FROM dbo.Result)
    print 'Number of records Query2: ' + cast(@@Rowcount as varchar(20))
    /*
    --Query3
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   (
                Low IN(SELECT Value FROM dbo.Result)
    AND High NOT IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    )
    OR
    (
                Low NOT IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    AND High IN(SELECT Value FROM dbo.Result)
    )
    print 'Number of records Query3: ' + cast(@@Rowcount as varchar(20))
    --Query 4
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   Low IN(SELECT Value FROM dbo.Result WHERE Value <> High)
    OR High IN(SELECT Value FROM dbo.Result WHERE Value <> Low)
    print 'Number of records Query4: ' + cast(@@Rowcount as varchar(20))
    */
    -- Naomi's versions
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses G
    inner Join dbo.Result R
    on G.Low = R.Value and not exists (select 1 from dbo.Result R1 where G.High = R1.Value) 
    UNION ALL 
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses G
    inner Join dbo.Result R
    on G.High = R.Value and not exists (select 1 from dbo.Result R1 where G.Low = R1.Value)
    print 'Number of records Naomi''s version: ' + cast(@@Rowcount as varchar(20))
    
    
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   exists (select 1 FROM dbo.Result where Value = Guesses.Low)
    AND not exists(SELECT 1 FROM dbo.Result where Value = Guesses.High )
    
    UNION ALL
    
    SELECT  PeopleID,
            Low,
            High
    FROM    dbo.Guesses
    WHERE   exists (select 1 FROM dbo.Result where Value = Guesses.High)
    AND not exists(SELECT 1 FROM dbo.Result where Value = Guesses.Low )
    print 'Number of records Query5: ' + cast(@@Rowcount as varchar(20))
    

    So, the important lesson we learned from this exercise - always analyze results, always try different variations of the query, always compare the resulting output and verify number of records. Finally, don't use OR with NOT IN as it doesn't perform good.

    Replied on Oct 10 2010 1:31AM  . 
    Naomi
    33 · 6% · 1774
  • Score
    7

    Explain the difference between NOT EXISTS and NOT IN regarding NULL values.

    Quoting Books Online:

    Any null values returned by subquery or expression that are compared to test_expression using IN or NOT IN return UNKNOWN. Using null values in together with IN or NOT IN can produce unexpected results.

    In contrast, NOT EXISTS will return TRUE when evaluating NULL. Consider the following queries:

    select 1 where exists(select null)
    select 1 where not exists(select null)
    

    The first returns a result (as EXISTS evaluates to TRUE) whereas the second does not (for the same reason). You could almost go so far as to say that they behave exactly as exact opposites as it relates to the handling of NULL.

    How come adding the ‘not equal to filter’ made such a difference in performance for Query3 vs Query1?

    From where I sit, it's not the fact of the 'not equal to filter' that makes the difference in Query3; rather, it’s that the evaluating values come from the outer query - making this a correlated subquery. By definition, this means that it is "evaluated by executing the subquery once and substituting the resulting value or values into the WHERE clause of the outer query" (Books Online). In the execution plan, this manifests itself as a single table scan of Guesses followed by a hash match with the Results table (it actually does this twice, once each for high and low respectively). By comparison, Query1 performs an index seek on Guesses followed by a nested loop . . . each executing ~500K times (and it does this 4 times over in the entirety of the plan). The cost of these operators executing this number of times ends up being about 52X more expensive than the table scans of Query3.

    Replied on Oct 10 2010 9:45PM  . 
    Scott Epperly
    157 · 1% · 301
  • Score
    7

    Very Interesting Question. SQL Server Query Optimizer it really helps us in performance tuning. I do agree to the fact that it’s an excellent tool.

    NOT EXISTS vs NOT IN regarding NULL values

    • "NOT IN" Clause will exclude NULLs from Source Table, in contrast to it "Not exists" will include NULLs.
    • If the source table we are checking for values contains null values, "NOT IN " will not return any rows as output. This need to be understood while working with null values and NOT IN.
    • NOT EXISTS will return TRUE when dealing with NULL values

    How come adding the ‘not equal to filter’ made such a difference in performance for Query3 vs Query1?

    It is not due to adding 'not equal to filter' that makes the performance difference in Query3. The values that come from the outer query made this as correlated sub query. Hence In the execution plan, it has as a single table scan with the Results table. The cost of the operators executing in Query1 (as per Query Execution Plan assessment) is more expensive than the table scans of Query3. This is due to Query1 performs an index seek on query estimation that is taking more time to execute.

    I can conclude that it’s our friend. We need to understand our friend and go into his shoes. If you do not understand or communicate your friend or partner it becomes a foe.

    Replied on Oct 11 2010 3:27AM  . 
    Vamshi
    131 · 1% · 376
Previous 1 | 2 Next

Your Answer


Sign Up or Login to post an answer.
Please note that your answer will not be considered as part of the contest because the competition is over. You can still post an answer to share your knowledge with the community.

Copyright © Rivera Informatic Private Ltd.