This question was asked by Erland sommarskog in SQL Server Quiz 2010, conducted by BeyondRelational.com.
Suppose, create a table like below,
CREATE TABLE quiz (
ident INT IDENTITY PRIMARY KEY,
colname1 VARCHAR(127) COLLATE Latin1_General_CI_AS NOT NULL,
colname2 VARCHAR(127) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
tblname VARCHAR(127) NOT NULL,
filler CHAR(100) NOT NULL DEFAULT ' '
INSERT quiz(colname1, colname2, tblname)
FROM sys.columns a
CROSS JOIN sys.columns b
CREATE INDEX name1_ix ON quiz(colname1)
CREATE INDEX name2_ix ON quiz(colname2)
Now run the below 2 queries and check the performance difference. Fourth query will bombed compared to other queries....... What is the reason behind that??
SELECT COUNT(DISTINCT tblname) FROM quiz WHERE colname1 = 'object_id'
SELECT COUNT(DISTINCT tblname) FROM quiz WHERE colname2 = 'object_id'
SELECT COUNT(DISTINCT tblname) FROM quiz WHERE colname1 = N'object_id'
SELECT COUNT(DISTINCT tblname) FROM quiz WHERE colname2 = N'object_id'
Checking execution plans reveals, Fourth query is using clustered index scan, while the others are using index seeks. Microsoft connect for this issue is below.
But their reply is not up to mark. They attributed this behaviour to size. Their reasoning is as 8-bit string compared to 16-bit string(Unicode input), it needs to scan. But however,3rd query using seek, which is also an 8-bit string compared to 16-bit string. So, its not about size. Its about collations.
However, Microsoft documented the below point in one of their articles.
if you are using the SQL collation "SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen.
Point to note here is SQL Collations, when they are comparing Unicode strings, they will ignore the special characters.
To find out the reason why the index scan is happening, here we need to understand how indexes will work.
Suppose, a table consists of an Integer column with 1 to 80 numbers, When an index is created on that column, index will creates all the Numbers in ascending order (Descending order, if explicitly specified). it will be organized like below.
suppose we are searching for 45, First the optimizer, will visit Page in Level1 and reads the entires, as Page 3 starts with 41 and Page 4 starts with 61, it will go to Page 3 and searches the records until it finds 45 or finds any number which is greater than that number. If it finds 45, it will show that record, otherwise, it will indicate as record doesn’t exists.
Now coming to our Quiz table, it also has an index on both ColName1 and ColName2, assume you inserted 3 records with ColName2 values ”a-a”,”a--b”,”a---c”
As this column has an index, these will be arranged in ascending order. As “-“ comes before characters such as a,b,c, these will be arranged in below order
Suppose, you are searching for the Non-Unicode word “a--b”, Optimizer, will first examine “a---c”. Since, “a--b” is greater than “a---c”, it will go to next record. Next record is “a--b”, so it compare that record and as both are same, it will mark it as successful.
Now If you are searching for the Unicode word N’a--b’, Optimizer, will first examine the word “a---c” and compare with input N’a--b’. Here the datatype is Unicode and column is of SQL collation, it will apply the rule, “if any special characters are there, ignore those.” So, the comparision now becomes “ac” and “ab”. As “ac” is greater than “ab”, It will stop the procedure and mark it as unsuccessful search. But actually the record exists. So, If it follows the seek procedure, it will not get the accurate results. So, Unicode word search on SQL collation column, always results in index scan.