T-SQL analytic functions in SQL Server can help solve problems quickly
Like other mainstream commercial database systems, SQL Server supports analytic functions in Transact-SQL
to depict complex analytical tasks. With the help of these analytic functions, we can
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
This was first published in September 2012
perform
common analyses, such as ranking, percentiles, moving averages and cumulative sums that can be
expressed concisely in a single SQL statement.
Before analytic functions, complex analytical tasks were performed using self-joins, correlated
subqueries, temporary tables or some combination of all three. That was inefficient and consumed a
significant amount of system resources. Expressing queries with analytic functions simplifies
complex tasks by eliminating programming self-joins and correlated subqueries and using fewer
temporary tables.
The first batch of Transact-SQL analytic functions came with the release of SQL Server 2005,
which included a variety of ranking functions, such as ROW_NUMBER, RANK, DENSE_RANK and NTILE. SQL
Server 2012 introduced eight more Transact-SQL analytic
functions: PERCENT_RANK, CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC, LEAD, LAG, FIRST_VALUE
and LAST_VALUE.
Ranking analytic functions
Ranking functions return the ranking value for rows within a partition.
There are four ranking functions:
- Row Number, which assigns a row number to each row in the result set.
- Rank, which returns the rank value for each row in the result set, but with gaps.
- Dense Rank, which is the same as the Rank function, but also displays the rank value for each
row in the result set and without gaps in the sequence.
- NTILE, which partitions the ranks into a specific numbered group.
For example, if you have a table with 30 values, you might use NTILE (3) to number the first 10
rows as group 1, the middle 10 rows as group 2 and the last 10 rows as group 3.
For each ranking function, we must specify an OVER clause, which determines the partitioning and
ordering of the rowset before the ranking function is applied.
Here is the general syntax of ranking functions:
FUNCTION (Argument1,…[n])
OVER ([PARTITION BY value_expression,…[n])
<<Order_by_clause>>)
An example of this would be if you were to analyze the following query and its results, which
would demonstrate the use of ranking functions based on the SalesQuota column.
SELECT [LastName]
,[FirstName]
,[SalesQuota]
,ROW_NUMBER() OVER
(ORDER BY [SalesQuota]) AS [Row_Number()]
,RANK() OVER (ORDER BY
[SalesQuota]) AS [RANK()]
,DENSE_RANK() OVER
(ORDER BY [SalesQuota]) AS [DENSE_RANK()]
,NTILE(10) OVER (ORDER
BY [SalesQuota]) AS [NTILE(4)]
FROM [AdventureWorks2012].[Sales].[vSalesPerson]
We can use the optional PARTITION BY clause, which divides the rows based on the value
expression. Then the rows are ranked in the order specified. For example, I'll rewrite the above
query as follows to partition the result set by the CountryRegionName column.
SELECT [LastName], [FirstName], [SalesQuota],
[CountryRegionName]
,ROW_NUMBER() OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [Row_Number()]
,RANK() OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [RANK()]
,DENSE_RANK() OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [DENSE_RANK()]
,NTILE(4) OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [NTILE(4)]
FROM
[AdventureWorks2012].[Sales].[vSalesPerson]
PERCENT_RANK
Use the PERCENT_RANK function to evaluate the relative position of each row in a query result
set or partition. SQL Server uses the following formula to calculate the value for the PERCENT_RANK
column:
(rank() – 1) / (total rows in a query result set
or partition – 1)
For example, the following example uses the PERCENT_RANK function to compute the rank of the
salesperson's sales quota within a country as a percentage. The PARTITION BY clause is specified
for partitioning of the rows in the result set by country region name, and the ORDER BY clause
orders the rows in each partition.
SELECT [LastName], [FirstName], [SalesQuota],
[CountryRegionName]
,RANK() OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [RANK()]
,PERCENT_RANK() OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [PERCENT_RANK()]
FROM
[AdventureWorks2012].[Sales].[vSalesPerson]
WHERE [SalesQuota] IS NOT NULL
CUME_DIST
The CUME_DIST function evaluates cumulative distribution value in a group of values in a given
result set or partition. SQL Server uses the following formula to calculate the value for the
CUME_DIST column:
(values less than or equal to the current value in
the group)
/ (total row in a query result set or
partition)
The following example uses the CUME_DIST function to calculate the sales quota percentile for
each salesperson's sales quota within a given country. The value returned by the CUME_DIST function
represents the percentage of salespeople who have a sales quota less than or equal to that of the
current salespeople in the same country.
SELECT [LastName], [FirstName], [SalesQuota],
[CountryRegionName]
,RANK() OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [RANK()]
,CUME_DIST() OVER
(PARTITION BY [CountryRegionName]
ORDER BY [SalesQuota]) AS [CUME_DIST()]
FROM
[AdventureWorks2012].[Sales].[vSalesPerson]
WHERE [SalesQuota] IS NOT NULL
PERCENTILE_CONT and PERCENTILE_DESC
The PERCENTILE_CONT function uses a discrete distribution model to calculate a percentile. It
accepts the percentile value (that is, the desired CUME_DIST value) and a sort specification to
return the value that would fall within that percentile value. The PERCENTILE_DESC function works
the same as a PERCENTILE_CONT function: It returns the smallest value whose percentile is greater
than or equal to the given percentile.
Here is the general syntax of PERCENTILE_CONT and PERCENTILE_DESC:
FUNCTION (CUME_DIST)
WITHIN GROUP
(<<Order_by_clause>>)
OVER ([PARTITION BY value_expression,…[n])
As yet another example, execute the following query, which uses PERCENTILE_CONT and
PERCENTILE_DISC to find the median employee salary in each business entity. These functions do not
always return the same value because PERCENTILE_CONT estimates the correct value, which may not
exist in the data set, while PERCENTILE_DISC always gives an actual value for the set.
SELECT TOP 15
[BusinessEntityID], [Rate]
,PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY
[Rate])
OVER (PARTITION BY
[BusinessEntityID]) AS [PERCENTILE_CONT (0.5)]
,PERCENTILE_DISC (0.5)
WITHIN GROUP (ORDER BY [Rate])
OVER (PARTITION BY
[BusinessEntityID]) AS [PERCENTILE_DISC (0.5)]
,CUME_DIST() OVER
(PARTITION BY [BusinessEntityID]
ORDER BY [Rate]) AS [CUME_DIST()]
FROM
[AdventureWorks2012].[HumanResources].[EmployeePayHistory]
LEAD and LAG
The LEAD function provides access to the row that follows the current row. LAG is the opposite
of LEAD: It provides access to the previous row instead of the row that follows the current row.
Here is the general syntax:
FUNCTION (scalar_expression [ ,offset ] , [ default
])
OVER ([PARTITION BY value_expression,…[n])
<<Order_by_clause>>)
The following example uses the LEAD and LAG functions to compare year-to-date sales among
AdventureWorks employees partitioned by sales territory.
SELECT [FirstName]
,[TerritoryGroup]
,[SalesYTD]
,LEAD([SalesYTD]) OVER
(PARTITION BY [TerritoryGroup]
ORDER BY [SalesYTD]) AS [Next_Lower_SalesYTD (LEAD
())]
,LAG([SalesYTD]) OVER
(PARTITION BY [TerritoryGroup]
ORDER BY [SalesYTD]) AS [Prev_Higer_SalesYTD (LAG
())]
FROM
[AdventureWorks2012].[Sales].[vSalesPerson]
WHERE [TerritoryGroup] IS NOT NULL
FIRST_VALUE and LAST_VALUE
FIRST_VALUE returns the first value of the result set or partition, whereas LAST_VALUE returns
the last value of the result set or partition. If the last value in the set is null, the function
returns NULL unless you specify IGNORE NULLS.
When not specified, the rows range clause defaults to “RANGE BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW”, which sometime returns an unexpected value. This is because the last value in the
window is fixed. For proper results, specify row range as either “RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING” or “RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING”.
Here is the syntax:
FUNCTION (scalar_expression [,offset] ,
[default])
OVER ([PARTITION BY value_expression,…[n])
<<Order_by_clause>>
<<Rows_range_clause>>)
The following example uses the FIRST_VALUE and LAST_VALUE functions to return highest and lowest
year-to-date figures for each sales territory.
SELECT DISTINCT [TerritoryGroup]
,FIRST_VALUE([SalesYTD])
OVER (PARTITION
BY [TerritoryGroup]
ORDER BY [SalesYTD]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW)
AS [Highest_SalesYTD (FIRST_VALUE)]
,LAST_VALUE([SalesYTD])
OVER (PARTITION
BY [TerritoryGroup]
ORDER BY [SalesYTD]
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS [Lowest_SalesYTD (FIRST_VALUE)]
FROM
[AdventureWorks2012].[Sales].[vSalesPerson]
WHERE [TerritoryGroup] IS NOT NULL
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation