We can use many aggregate functions with the OVER clause, however, we cannot use Distinct Counts in the OVER clause. I had a problem in which i have to count distinct records based on different groupings on columns.
Let me illustrate the problem. I have a table named @student as shown below:
DECLARE @student TABLE(Gender CHAR(1),Class char(3),AttDt int, StudentId int)
INSERT INTO @student
VALUES ('M','I',201206,1),
('M','I',201205,1),
('M','II',201205,1),
('M','I',201206,2),
('F','II',201206,2),
('F','II',201205,3),
('F','I',201205,4),
('F','I',201206,4),
('F','I',201206,5)
SELECT *
FROM @student
ORDER BY Gender,Class,AttDt
Gender Class AttDt StudentId
-----------------------------------
F I 201205 4
F I 201206 4
F I 201206 5
F II 201205 3
F II 201206 2
M I 201205 1
M I 201206 1
M I 201206 2
M II 201205 1
I have to count distinct student id's based on different groupings of columns such as
- Gender
- Gender, Class
- Gender, Class, AttDt
The normal approach to this problem will be using group by for each set of groupings and then joining the results as shown in the T-sql query below:
;WITH CTE_student as
(
SELECT *
FROM @student
),
Cte_Gender as
(
SELECT Gender, COUNT(DISTINCT Studentid) Studentid
FROM cte_student
GROUP BY Gender
),
Cte_GenderClass as
(
SELECT Gender,Class,COUNT(DISTINCT Studentid) Studentid
FROM cte_student
GROUP BY Gender,Class
),
cte_GenderClassDt as
(
SELECT Gender,Class,AttDt,COUNT(DISTINCT Studentid) Studentid
FROM cte_student
GROUP BY Gender,Class,AttDt
)
SELECT t1.Gender,
t1.Class,
t1.AttDt,
t1.Studentid,
t1.studentid GenderClassDtcnt,
t2.studentid GenderClasscnt,
t3.studentid Gendercnt
FROM cte_GenderClassDt t1
LEFT JOIN cte_GenderClass t2
ON t1.Gender = t2.Gender and t1.Class = t2.Class
LEFT JOIN cte_Gender t3
ON t1.Gender = t3.Gender
ORDER BY t1.Gender,t1.Class,t1.AttDt
Here, in this query as the grouping sets grow, the query will grow and will be difficult to maintain. I overcome this by using Aggregate function with OVER clause. Since Count Distinct cannot be used directly with OVER clause, i have taken the rank first and then selected the MAX Id from the set. The solution query is shown below:
SELECT Gender,
Class,
AttDt,
StudentId,
MAX(RankbyGender) OVER(PARTITION BY Gender) Gendercnt,
MAX(RankbyClassGender) OVER(PARTITION BY Gender,Class) GenderClasscnt,
MAX(RankbyClassGenderDt) OVER(PARTITION BY Gender,Class,AttDt ) GenderClassDtcnt
FROM
(
SELECT *,
RankbyGender = DENSE_RANK() OVER(PARTITION BY Gender ORDER BY StudentID) ,
RankbyClassGender = DENSE_RANK() OVER(PARTITION BY Gender,Class ORDER BY StudentID) ,
RankbyClassGenderDt = DENSE_RANK() OVER(PARTITION BY Gender,Class,AttDt ORDER BY StudentID)
FROM @student
)t
ORDER BY Gender,Class,AttDt
The output of the query is:
Gender Class AttDt StudentId Gendercnt GenderClasscnt GenderClassDtcnt
--------------------------------------------------------------------------------------------
F I 201205 4 4 2 1
F I 201206 4 4 2 2
F I 201206 5 4 2 2
F II 201205 3 4 2 1
F II 201206 2 4 2 1
M I 201205 1 2 2 1
M I 201206 1 2 2 2
M I 201206 2 2 2 2
M II 201205 1 2 1 1
This problem can be solved by many other ways. I have demonstrated a simple way which has helped me to reduce the size of my stored procedure. Please let me know your feedback on this post, and if you have used any other way of handling this problem please post your comment on it.