Getting Started with Adobe After Effects - Part 6: Motion Blur


Upload Image Close it
Select File

I have around 6+ years of experience in SQL Server development and in Business Intelligence area. I am working as Datawarehouse developer. My blog space includes all technical areas related to SQL Server and MSBI (SSIS, SSRS and SSAS).
Browse by Tags · View All
SSIS 7
xml 3
dynamic 3
Union All 3
SSRS 2
Data flow task 2
Buffer 2
Time 2
Date 2
varchar(max) 1

Archive · View All
February 2013 3
September 2009 3
June 2013 2
January 2013 2
November 2012 2
July 2012 2
July 2009 2
April 2013 1
December 2012 1
February 2010 1

Distinct counts using Aggregate functions with OVER clause

Jul 6 2012 12:00AM by Divya Agrawal Shah   

We can use many aggregate functions with the OVER clause, however, we cannot use Distinct Counts in the OVER clause. I had a problem in which i have to count distinct records based on different groupings on columns.

Let me illustrate the problem. I have a table named @student as shown below:

DECLARE @student TABLE(Gender CHAR(1),Class char(3),AttDt int, StudentId int)

INSERT INTO @student
VALUES ('M','I',201206,1),
('M','I',201205,1),
('M','II',201205,1),
('M','I',201206,2),
('F','II',201206,2),
('F','II',201205,3),
('F','I',201205,4),
('F','I',201206,4),
('F','I',201206,5)

SELECT * 
FROM @student
ORDER BY Gender,Class,AttDt

Gender	Class	AttDt	StudentId
-----------------------------------
F	I  	201205	4
F	I  	201206	4
F	I  	201206	5
F	II 	201205	3
F	II 	201206	2
M	I  	201205	1
M	I  	201206	1
M	I  	201206	2
M	II 	201205	1

I have to count distinct student id's based on different groupings of columns such as

  • Gender
  • Gender, Class
  • Gender, Class, AttDt

The normal approach to this problem will be using group by for each set of groupings and then joining the results as shown in the T-sql query below:

;WITH CTE_student as
(
	SELECT *                                                                    
	FROM @student
),
Cte_Gender as
(
	SELECT Gender, COUNT(DISTINCT Studentid) Studentid
	FROM cte_student
	GROUP BY Gender
),
Cte_GenderClass as
(
	SELECT Gender,Class,COUNT(DISTINCT Studentid) Studentid
	FROM cte_student
	GROUP BY Gender,Class
),
cte_GenderClassDt as
(
	SELECT Gender,Class,AttDt,COUNT(DISTINCT Studentid) Studentid
	FROM cte_student
	GROUP BY Gender,Class,AttDt
)
SELECT	t1.Gender,
		t1.Class,
		t1.AttDt,
                t1.Studentid,
		t1.studentid GenderClassDtcnt,
		t2.studentid GenderClasscnt,
		t3.studentid Gendercnt
FROM cte_GenderClassDt t1
LEFT JOIN cte_GenderClass t2 
    ON t1.Gender = t2.Gender and t1.Class = t2.Class
LEFT JOIN cte_Gender  t3 	
    ON t1.Gender = t3.Gender	
ORDER BY t1.Gender,t1.Class,t1.AttDt

Here, in this query as the grouping sets grow, the query will grow and will be difficult to maintain. I overcome this by using Aggregate function with OVER clause.  Since Count Distinct cannot be used directly with OVER clause, i have taken the rank first and then selected the MAX Id from the set. The solution query is shown below:

SELECT Gender,
		Class,
		AttDt,
		StudentId,
		MAX(RankbyGender) OVER(PARTITION BY Gender) Gendercnt,
		MAX(RankbyClassGender) OVER(PARTITION BY Gender,Class) GenderClasscnt,
		MAX(RankbyClassGenderDt) OVER(PARTITION BY Gender,Class,AttDt ) GenderClassDtcnt
 FROM
 (
 SELECT *,
		RankbyGender = DENSE_RANK() OVER(PARTITION BY Gender ORDER BY StudentID) ,   
		RankbyClassGender = DENSE_RANK() OVER(PARTITION BY Gender,Class ORDER BY StudentID) ,                                                                       
		RankbyClassGenderDt = DENSE_RANK() OVER(PARTITION BY Gender,Class,AttDt ORDER BY StudentID)                                                                        
 FROM @student
 )t
 ORDER BY Gender,Class,AttDt

The output of the query is:

Gender	Class	AttDt	StudentId	Gendercnt	GenderClasscnt	GenderClassDtcnt
--------------------------------------------------------------------------------------------
F	I  	201205	4	4	2	1
F	I  	201206	4	4	2	2
F	I  	201206	5	4	2	2
F	II 	201205	3	4	2	1
F	II 	201206	2	4	2	1
M	I  	201205	1	2	2	1
M	I  	201206	1	2	2	2
M	I  	201206	2	2	2	2
M	II 	201205	1	2	1	1
This problem can be solved by many other ways. I have demonstrated a simple way which has helped me to reduce the size of my stored procedure. Please let me know your feedback on this post, and if you have used any other way of handling this problem please post your comment on it.

Tags: Aggregate function, OVER clause, Distinct count, groups, group by


Divya Agrawal Shah
89 · 2% · 629
11
 
0
Lifesaver
 
 
 
0
Incorrect



Submit

13  Comments  

  • Welcome back to blogging! :-)

    commented on Jul 6 2012 6:00AM
    Jacob Sebastian
    1 · 100% · 32235
  • Thanks sir, will try to remain active now!

    commented on Jul 8 2012 11:25PM
    Divya Agrawal Shah
    89 · 2% · 629
  • Thanks Joe for the information. I think if you have the clustered index on the column, on which we are ordering for DENSE Rank, the performance will improve a lot. I have tested the query for large resultset, but i did not find any duplicate records concern. Make sure, you have a unique set of combination for columns in partition and column in order by clause. Let me know if i can help you.

    commented on Jul 10 2012 11:47PM
    Divya Agrawal Shah
    89 · 2% · 629
  • Excellent artice. I didn't try this out, but can we expect better performance with aggregate functions? Keep writing!

    commented on Jul 12 2012 11:38PM
    renmathew
    2824 · 0% · 4
  • Thanks Renju.

    commented on Jul 12 2012 11:46PM
    Divya Agrawal Shah
    89 · 2% · 629
  • very informative article..thanks for sharing...

    commented on Jul 13 2012 2:14AM
    adamgorge
    147 · 1% · 340
  • Hi Divya

    There's a more efficient pattern for dealing with the lack of support for DISTINCT within window functions. In your scenario it is dependent upon the grain of the data so, before I post it, can you explain your sample data a little more? The student with id 2 appears in the sample set twice, once as a male and once as a female. Is this intended?

    commented on Jul 16 2012 3:05AM
    a.diniz
    316 · 0% · 139
  • Yes @@a.diniz. Basically i want the distinct count of students, with different groupings on columns.

    commented on Jul 16 2012 3:30AM
    Divya Agrawal Shah
    89 · 2% · 629
  • Understood, but what is a distinct student? StudentId = 2 has a record for both a male and a female implying that StudentId does not logically relate to a single person. Anyway, putting aside the particulars of the sample data, here's my basic algorithm for handling DISTINCT within window functions:

     ;WITH cte AS (
         SELECT 
            PartitionColumn,
            TargetForDistinctColumn,
            distinctFlag = CASE WHEN ROW_NUMBER() 
                                     OVER(PARTITION BY PartitionColumn, TargetForDistinctColumn
                                          ORDER BY TargetForDistinctColumn) = 1
                                THEN 1 END         
         FROM table
      )
      SELECT
         PartitionColumn,
         TargetForDistinctColumn,
         COUNT(distinctFlag) OVER(PARTITION BY PartitionColumn)
      FROM cte
    

    The idea is that if a value x is distinct within a partition {p}, x is the only member of the sub-partition {p, x}. For counting distinct values of x, we can therefore ignore all but the first instance of x within a partition {p, x} (assuming {p, x} is sorted by ascending values of x). The count of 'unignored' instances then yields the DISTINCT COUNT.

    This algorithm produces an almost identical execution plan to Divya's. Note however that the DENSERANK iterator is slighlty more expensive than ROWNUMBER so across datasets with larger partition sizes, the COUNT(ROWNUMBER()=1) algorithm will have the edge over the MAX(DENSERANK()) one.

    commented on Jul 16 2012 5:46AM
    a.diniz
    316 · 0% · 139
  • Erm... Apologies all. While the algorithm is sound, it actually performs a little worse than Divya's MAX(DENSERANK()) algorithm when the data volumes increase! What I said about the cost of ROWNUMBER vs DENSE_RANK is true but introducing a sub-partition purely to identify distinct values causes an additional SORT operator within the plan. It is the overhead of this sort operator which makes my algorithm more expensive.

    commented on Jul 16 2012 10:37AM
    a.diniz
    316 · 0% · 139
  • Interestingly, while the code is more verbose, the traditional CTE solution is dramatically more efficient than both my (rubbish) solution and Divya's nifty MAX(DENSE_RANK) one. And what's more, the traditional solution becomes increasingly more efficient with the number of grouping sets (provided you are able to create a indexes on the base table).

    The reason for this is that, under the hood, window functions require sorted input (which is why you typically see SORT operators next to the SEGMENT / SEQUENCE PROJECT operators in the execution plan when a window function is used). In our scenario, each of the 3 DENSE_RANK windows impose a different sort order over the same resultset. This results in 3 (costly) SORT operators. We could make the base table a clustered index ordered by Gender, Class, AttDt and StudentId and recompile the plan to remove one SORT operator but we can't remove them all. This is because in our query, the window functions are operating over the same resultset (which, therefore, can have only one order) and make use of just one index.

    The traditional CTE solution, on the other hand, can utilise several appropriately defined indexes as each CTE defines its own resltset.

    In tests against a student table of ~180,000 records, I've found the cost of the CTE solution to be ~10% that of the MAX(DENSERANK()). CPU times: 625ms (for CTE) vs 3900ms (for MAX(DENSERANK()) Logical Reads: 1400 (for CTE) vs 731,000 (for MAX(DENSE_RANK())

    Obviously, this is at the cost of maintaining the required indexes...

    commented on Jul 16 2012 11:47AM
    a.diniz
    316 · 0% · 139
  • Interestingly, while the code is more verbose, the traditional CTE solution is dramatically more efficient than both my (rubbish) solution and Divya's nifty MAX(DENSE_RANK) one. And what's more, the traditional solution becomes increasingly more efficient with the number of grouping sets (provided you are able to create a indexes on the base table).

    The reason for this is that, under the hood, window functions require sorted input (which is why you typically see SORT operators next to the SEGMENT / SEQUENCE PROJECT operators in the execution plan when a window function is used). In our scenario, each of the 3 DENSE_RANK windows impose a different sort order over the same resultset. This results in 3 (costly) SORT operators. We could make the base table a clustered index ordered by Gender, Class, AttDt and StudentId and recompile the plan to remove one SORT operator but we can't remove them all. This is because in our query, the window functions are operating over the same resultset (which, therefore, can have only one order) and make use of just one index.

    The traditional CTE solution, on the other hand, can utilise several appropriately defined indexes as each CTE defines its own resltset.

    In tests against a student table of ~180,000 records, I've found the cost of the CTE solution to be ~10% that of the MAX(DENSERANK()). CPU times: 625ms (for CTE) vs 3900ms (for MAX(DENSERANK()) Logical Reads: 1400 (for CTE) vs 731,000 (for MAX(DENSE_RANK())

    commented on Jul 17 2012 2:53AM
    a.diniz
    316 · 0% · 139
  • Thanks for comparing the solutions.

    commented on Jul 17 2012 3:25AM
    Divya Agrawal Shah
    89 · 2% · 629

Your Comment


Sign Up or Login to post a comment.

"Distinct counts using Aggregate functions with OVER clause" rated 5 out of 5 by 11 readers
Distinct counts using Aggregate functions with OVER clause , 5.0 out of 5 based on 11 ratings
    Copyright © Rivera Informatic Private Ltd Contact us      Privacy Policy      Terms of use      Report Abuse      Advertising      [ZULU1097]