I have posted a blog post about Different ways to find DISTINCT values where I have shown six different ways to find DISTINCT values. Here is the analysis of how long each method takes.
Create a temporary table named #sales_details
create table #sales_details(item_id int not null)
Populate one million data into this table
insert into #sales_details (item_id)
select
top 1000000
checksum(newid())%10000
from
sys.objects as so1 cross join
sys.objects as so2 cross join
sys.objects as so3 cross join
sys.objects as so4
Test various methods to find DISTINCT values
SET STATISTICS TIME ON
print '======================================================'
Print 'Method 1 starts'
print '======================================================'
select distinct item_id from #sales_details
print '======================================================'
Print 'Method 2 starts'
print '======================================================'
select item_id from #sales_details
Group by item_id
print '======================================================'
Print 'Method 3 starts'
print '======================================================'
select item_id from #sales_details
UNION
select item_id from #sales_details
print '======================================================'
Print 'Method 4 starts'
print '======================================================'
select item_id from #sales_details
UNION
select item_id from #sales_details where 1=0
print '======================================================'
Print 'Method 5 starts'
print '======================================================'
select item_id from #sales_details
UNION
select 0 where 1=0
print '======================================================'
Print 'Method 6 starts'
print '======================================================'
select item_id from
(
select row_number() over (partition by item_id order by item_id) as sno,* from #sales_details
) as t
where sno=1
print '======================================================'
Print 'Method 7 starts' Pointed by Dwain camps
print '======================================================'
;WITH RecursiveCTE
AS (
SELECT data = MIN(T.item_id)
FROM #sales_details T
UNION ALL
SELECT R.item_id
FROM (
-- A cunning way to use TOP in the recursive part of a CTE :)
SELECT T.item_id,
rn = ROW_NUMBER() OVER (ORDER BY T.item_id)
FROM #sales_details T
JOIN RecursiveCTE R
ON R.data < T.item_id
) R
WHERE R.rn = 1
)
SELECT *
FROM RecursiveCTE
OPTION (MAXRECURSION 0);
SET STATISTICS TIME OFF
Here is the output shown in tabular format
| Method # |
Query |
CPU Time |
Elapsed Time |
| 1 |
select distinct item_id from #sales_details |
421 |
531 |
| 2 |
select item_id from #sales_details Group by item_id |
312 |
462 |
| 3 |
select item_id from #sales_details UNION select item_id from #sales_details |
749 |
1125 |
| 4 |
select item_id from #sales_details UNION select item_id from #sales_details where 1=0 |
405 |
518 |
| 5 |
select item_id from #sales_details UNION select 0 where 1=0 |
328 |
460 |
| 6 |
select item_id from ( select row_number() over (partition by item_id order by item_id) as sno,* from #sales_details ) as t where sno=1 |
343 |
398 |
| 7 |
Paul White's CTE method
;WITH RecursiveCTE AS ( SELECT data = MIN(T.item_id) FROM #sales_details T UNION ALL SELECT R.item_id FROM ( -- A cunning way to use TOP in the recursive part of a CTE :) SELECT T.item_id, rn = ROW_NUMBER() OVER (ORDER BY T.item_id) FROM #sales_details T JOIN RecursiveCTE R ON R.data < T.item_id ) R WHERE R.rn = 1 ) SELECT * FROM RecursiveCTE OPTION (MAXRECURSION 0); |
468 with non clustered index
765 with clustered index
0 with clustered index with data ranges between 0 and 99 |
527 with non clustered index
2231 with clustered index
22 with clustered index with data ranges between 0 and 99
|
The performance of the method 7 that involves in recursive CTE depends on the index as well as the nature of the data. Without index, it is running for several minutes. With index it is running fast. But if there are lesser number of distinct values, it runs very fast. For example if we use checksum(newid())%100 in place of checksum(newid())%10000, method 7 outperform every other methods