Friday, September 9, 2011

SQL Avoiding Subquery Using OVER()

I came across a cool way to avoid using subqueries when you need to get different aggregates in the same result set. Typically, you use a subquery with the aggregate you want, and then select that value for the other aggregates you want to work with. While this doesn't really cause any problems, sometimes it clutters up the code, and as it turns out, can be slightly more costly than other methods.

The code below shows what I mean. I have an example of how you might think to go about doing it, which doesn't work, a method that utilizes a subquery, and a nifty trick you can do using an over() clause with the aggregates.


if OBJECT_ID('tempdb.dbo.#test') > 0 drop table #test
create table #test
(
value int
)

insert into #test (value)
SELECT 1
union all
SELECT 1
union all
SELECT 2
union all
SELECT 5

SELECT * FROM #test



Scenario: we want to see what percent of the time a given value occurs in a set. To do this, we want to divide the count of each value by the count of total values. In the above test data, 1 occurs 50% of the time (2/4) whereas 2 and 5 occur 25% of the time (1/4, 1/4). This is a pretty simple scenario to solve, but there are a few ways to go about doing it. Pay special attention to the [TotalCount] column (aka the denominator of the [Percent] column.


--This query is in valid because the aggregates aren't part of the GROUP BY
SELECT
[Count] = COUNT(1),
[TotalCount] = SUM(COUNT(1)),
[Percent] = COUNT(1) * 100.0 / SUM(COUNT(1))
FROM #test
GROUP BY value

--Subquery option
--This is the way I initially thought of doing it
SELECT
[Count] = COUNT(1),
[TotalCount] = (SELECT COUNT(1) FROM #test),
[Percent] = COUNT(1) * 100.0 / (SELECT COUNT(1) FROM #test)
FROM #test
GROUP BY value

--OVER() option
--This is the cool way that I saw a co-worker do today
--by using an empty OVER() statement, you can avoid using the subquery and actually save some performance costs
SELECT
[Count] = COUNT(1),
[TotalCount] = SUM(count(1)) OVER(),
[Percent] = COUNT(1) * 100.0 / SUM(COUNT(1)) OVER()
FROM #test
GROUP BY value


The over() clause allows you to apply the aggregate over a different window as you see fit, which in this case ends up being over the entire set. The end result is the same but with much less code, and faster.



The relative cost of the subquery method is 59% compared to 41% using the over() clause, which is pretty significant. Toy around with the over() clause and using different aggregate functions. This primarily works because you have to hit #test once for the overall query, and again for the subquery. The method with the over() clause only hits the #test table once, and most of the cost comes from a sort performed by the over(). You can generate some useful figures to work with without having to really delve into any subqueries.

No comments: