Click here to monitor SSC
Av rating:
Total votes: 57
Total comments: 11


Fabiano Amorim
The Query Optimizer: Date Correlation Optimisation
01 October 2009

In SQL Server 2005, a feature was introduced that was hardly noticed, but which might make a great difference to anyone doing queries involving temporal data. For anyone doing Data Warehousing, timetabling, or time-based pricing, this could speed up your queries considerably. Who better to introduce this than Query Optimizer expert, Fabiano Amorim?

Since SQL Server 2005, there has been a database option which can be SET at database level called DATE_CORRELATION_OPTIMIZATION: This feature is OFF by default. Personally, I have never seen anyone use this feature, but I believe that this is only because there is so little material around that refers to, or explains, this feature.

It is very important that the developer or DBA knows the type of application and database query that would benefit of this feature, since it can produce striking increases in performance; but, at this point, you are probably wondering, ‘what does this feature actually do?’

Tying Dates together

When enabled, the DATE_CORRELATION_OPTIMIZATION allows the Query Optimizer to improve the performance of certain queries by gathering some statistical information about the DATETIME columns of two tables which are linked by one-column foreign key. This collected data is generally known as correlation statistics (not to be confused with Query Optimization Statistics), because they help the Query Optimizer to identify a correlation between two DATETIME columns. With this information, the Query Optimizer can make the best decision to obtain a better execution plan, and which data can be filtered to avoid unnecessary page reads.

Let’s try an example:  we have one table with some Orders (Pedidos) and a column called OrderDate(Data_Pedido), and another table called OrderItems(Items) and a column called DeliveryDate (Data_Entrega) where the delivery date is not equal to the order date but where the delivery date is within a range of dates after an order. In other words, where the two dates correlate, but are not equal. As an example of this, Imagine that  I’ve made a purchase at submarino.com.br, buying a lot of different products, from books to a plush toys. Perhaps I ought to explain that my wife likes penguins, so I buy one. Anyway, despite the fact that I only made one order, I received each product on different days, because each product came from different places.  

In this example, we have a table called orders with the date of the order, and a table called OrderItems with the delivery date of each product. We can be fairly confident in saying that the date of order is always very close to the delivery date of each item. That represents a clear correlation between the OrderDate and the DeliveryDate.

I hope that this explains what is meant by a correlation between the two DATETIME columns. So let’s get back to the optimization.

Suppose you have a query which has a join between the Order and OrderItems columns, and you are doing a filter in one of DATETIME column, the SQL Server can modify your query to make that better without you seeing any change; All internally and automatically. How?   Let’s go start with an example.

First of all we’ll create one metadata to allow us see tables close to the example I've described.

The names of the tables are in Portuguese, but I’m sure this will not be a problem to you.

IF OBJECT_ID('Pedidos') IS NOT NULL

BEGIN

  DROP TABLE Items

  DROP TABLE Pedidos

END

GO

CREATE TABLE Pedidos(ID_Pedido   Integer Identity(1,1),

                     Data_Pedido DateTime NOT NULL,--The columns cannot accept null values

                     Valor       Numeric(18,2),

                     CONSTRAINT xpk_Pedidos PRIMARY KEY (ID_Pedido))

GO

CREATE TABLE Items(ID_Pedido    Integer,

                   ID_Produto   Integer,

                   Data_Entrega DateTime NOT NULL,--The columns cannot accept null values

                   Quantidade   Integer,

                   CONSTRAINT xpk_Items PRIMARY KEY NONCLUSTERED(ID_Pedido, ID_Produto))

GO

-- At least one of the DATETIME columns, must belong to a cluster index

CREATE CLUSTERED INDEX ix_Data_Entrega ON Items(Data_Entrega)

GO

 

-- There must to be a foreign key relationship between the tables that contain correlation date

ALTER TABLE Items ADD CONSTRAINT fk_Items_Pedidos FOREIGN KEY(ID_Pedido) REFERENCES Pedidos(ID_Pedido)

GO

DECLARE @i Integer

SET @i = 0

WHILE @i < 10000

BEGIN

  INSERT INTO Pedidos(Data_Pedido,

                      Valor)

  VALUES(GetDate() - ABS(CheckSum(NEWID()) / 10000000),

         ABS(CheckSum(NEWID()) / 1000000))

  SET @i = @i + 1

END

GO

INSERT INTO Items(ID_Pedido, ID_Produto, Data_Entrega, Quantidade)

SELECT ID_Pedido,

       ABS(CheckSum(NEWID()) / 10000000),

       Data_Pedido + ABS(CheckSum(NEWID()) / 100000000),

       ABS(CheckSum(NEWID()) / 10000000)

  FROM Pedidos

GO

INSERT INTO Items(ID_Pedido, ID_Produto, Data_Entrega, Quantidade)

SELECT ID_Pedido,

       ABS(CheckSum(NEWID()) / 10000),

       Data_Pedido + ABS(CheckSum(NEWID()) / 100000000),

       ABS(CheckSum(NEWID()) / 10000000)

  FROM Pedidos

GO

After the tables are created, we’ll have one structure close to what has been presented above, let’s go see one sample of the data.

Now, suppose one developer builds the following query.

SELECT *

  FROM Pedidos

INNER JOIN Items

    ON Pedidos.ID_Pedido = Items.ID_Pedido

WHERE Pedidos.Data_Pedido BETWEEN '20090301' AND '20090305'

Translating, I want all data from Orders and Items sold between 2009/03/01 and 2009/03/05. To this query the Query Optimizer created the follow execution plan:

SELECT *    FROM Pedidos   INNER JOIN Items      ON Pedidos.ID_Pedido = Items.ID_Pedido   WHERE Pedidos.Data_Pedido BETWEEN '20090301' AND '20090305'
  |--Hash Match(Inner Join, HASH:([Performance].[dbo].[Pedidos].[ID_Pedido])=([Performance].[dbo].[Items].[ID_Pedido]))
       |--Clustered Index Scan(OBJECT:([Performance].[dbo].[Pedidos].[xpk_Pedidos]), WHERE:([Performance].[dbo].[Pedidos].[Data_Pedido]>='2009-03-01' AND [Performance].[dbo].[Pedidos].[Data_Pedido]<='2009-03-05'))
       |--Clustered Index Scan(OBJECT:([Performance].[dbo].[Items].[ix_Data_Entrega]))

In the text execution plan, we can easily see that the filter specified in the WHERE clause, was applied where the data was reading from index xpk_Pedidos, and the column which was received the filter was just the Data_Pedido specified in the WHERE clause.

The first thing we could do to optimize this query, is to create one index based on Data_Pedido column. You may be thinking that this would be the first thing that anyone would do with the query. Ok, you are right;  but we will just create this index and continue with the optimization.

CREATE INDEX ix_teste ON Pedidos(Data_Pedido) INCLUDE(Valor)

After create the index we have the follow plan:

SELECT *    FROM Pedidos   INNER JOIN Items ON Pedidos.ID_Pedido = Items.ID_Pedido   WHERE Pedidos.Data_Pedido BETWEEN '20090301' AND '20090305'
  |--Hash Match(Inner Join, HASH:([Performance].[dbo].[Pedidos].[ID_Pedido])=([Performance].[dbo].[Items].[ID_Pedido]))
       |--Index Seek(OBJECT:([dbo].[Pedidos].[ix_teste]), SEEK:([dbo].[Pedidos].[Data_Pedido] >= '2009-03-01' AND [dbo].[Pedidos].[Data_Pedido] <= '2009-03-05') ORDERED FORWARD)
       |--Clustered Index Scan(OBJECT:([Performance].[dbo].[Items].[ix_Data_Entrega]))

Now the query is a little better, instead of making a Clustered Index Scan to read the data from Pedidos table, the SQL makes an Index Seek based on the index which we just created.  Now, what else we could do to improve this further? It is time to see the Correlation_Optimization work.

As we can see, there is a point of bottleneck in the execution plan to read the data from the table Items: According to the execution plan, this read represents 49% of the entire cost, and the execution plan was using the clustered Index Scan operator to read the data from the table. If we could do something to use the Index Seek operator, then maybe the query could be improved. Is there something that we can do to the Query Optimizer use the Index Seek operator?

Knowing the database, the metadata and the business rules, we could modify the query to use one filter in the Items table. But, to do that, we need to be sure about this, because this change can’t modify the result of the query. One way to do this would be like:

  1. Based on the Orders between 2009/03/01 and 2009/03/05, look which are the min and the max delivery date from the Orders Items.
  2. With that information, put one new filter on the query to column Items.Data_Entrega.

After that, we have the following query:

SELECT *

  FROM Pedidos

INNER JOIN Items

    ON Pedidos.ID_Pedido = Items.ID_Pedido

WHERE Pedidos.Data_Pedido BETWEEN '20090301' AND '20090305'

   AND Items.Data_Entrega BETWEEN '20090301' AND '20090325 23:59:59.000'

This way we could force a filter on the Items table as in the plan:

But, I now have a question. How to do to know the values specified on Item 1 and 2 above mentioned? Is there a way to know these values without searching the tables? The problem is that  these values are not fixed values, but  are variables.

What about letting the SQL Server Query Optimizer figure out that for you? All you need to do is to enable the DATE_CORRELATION_OPTIMIZATION, then the SQL Server Query Optimizer will identify that for you, and will apply the filter on the Data_Entrega column itself. Wow! pretty smart isn’t it?  let me see if that is really true….

ALTER DATABASE <YOURDATABASE> SET DATE_CORRELATION_OPTIMIZATION ON;

Let’s now run the query without the Data_Entrega filter again.

SELECT *    FROM Pedidos   INNER JOIN Items      ON Pedidos.ID_Pedido = Items.ID_Pedido   WHERE Pedidos.Data_Pedido BETWEEN '20090301' AND '20090305'
  |--Hash Match(Inner Join, HASH:([dbo].[Pedidos].[ID_Pedido])=([dbo].[Items].[ID_Pedido]))
       |--Index Seek(OBJECT:([dbo].[Pedidos].[ix_teste]), SEEK:([dbo].[Pedidos].[Data_Pedido] >= '2009-03-01' AND [dbo].[Pedidos].[Data_Pedido] <= '2009-03-05') ORDERED FORWARD)
       |--Clustered Index Seek(OBJECT:([dbo].[Items].[ix_Data_Entrega]), SEEK:([dbo].[Items].[Data_Entrega] >= '2009-02-28' AND [dbo].[Items].[Data_Entrega] < '2009-04-29') ORDERED FORWARD)

Look;  this time, even when we have not specified the Data_Entrega filter, the Query Optimizer builds this for me.

Summary

The DATE_CORRELATION_OPTIMIZATION is a feature that is very good as a way of improving the performance of queries that use joins based on DATETIME values that are related but not necessarily equal, but it is almost never used outside specialized data-warehousing applications and BI.  This may be because there needs to be a good analysis, and prior thought, about when and where this can be used.

All the following conditions must be met to this feature work:

  • The database SET options must be set in the following way. All the following database options, ANSI_NULLS, ANSI_PADDING, ANSI_WARNINGS, ARITHABORT, CONCAT_NULL_YIELDS_NULL, and QUOTED IDENTIFIER must be SET to ON. NUMERIC_ROUNDABORT must be SET to OFF.
  • There must be a single-column foreign key relationship between the tables.
  • The tables must both have DATETIME columns that are defined NOT NULL.
  • At least one of the DATETIME columns must be the key column of a clustered index (if the index key is composite, it must be the first key), or it must be the partitioning column, if it is a partitioned table.
  • Both tables must be owned by the same user.

If you can meet these conditions and you are dealing with temporal data, then you could be in for a pleasant surprise



This article has been viewed 13887 times.
Fabiano Amorim

Author profile: Fabiano Amorim

Fabiano is fascinated by the SQL Server Query Processor and the way it works to optimize queries, procedures and functions. He graduated as a Technical Processor from Colégio Bezerra de Menezes, SP- Brazil, and has worked for several years with SQL Server, focusing in creating Data Warehouses and optimizing T-SQL codes for many companies in Brazil and Argentina. Fabiano is a SQL Server MVP, MCP for SQL Server 2000, MCTS and MCITP Data Base Developer for SQL Server 2005 and 2008. He also is actively involved in SQL Server community though forums such as MSDN and TechNet Brazil, writes articles for Simple-Talk and SQL Server Magazine Brazil, and he also presents online Webcasts and In-Person events for Microsoft Brazil. His blog is on http://blogfabiano.com

Search for other articles by Fabiano Amorim

Rate this article:   Avg rating: from a total of 57 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Very good !!!
Posted by: laerte (view profile)
Posted on: Thursday, October 01, 2009 at 9:54 AM
Message: Great, very great article my friend !!! This feature is something a few DBA's know. I even learned from you. Success!!

Subject: Great!
Posted by: Cláudio Cabral (not signed in)
Posted on: Thursday, October 01, 2009 at 11:46 AM
Message: Congratulations Fabiano. You give the meaning to Most Valuable Professional.

Subject: congrats!
Posted by: ferreira (view profile)
Posted on: Friday, October 02, 2009 at 8:50 PM
Message: great article Fabiano! in fact few people knows about this feature

Subject: Good Article
Posted by: Jack Corbett (not signed in)
Posted on: Monday, October 05, 2009 at 8:28 AM
Message: This is the first time that I have heard of this feature. Great article.

Subject: Views
Posted by: Peso (view profile)
Posted on: Monday, October 05, 2009 at 1:14 PM
Message: You maybe should have mentioned the tons of views that are automatically created...

Subject: Great Article!
Posted by: Alexandre Lopes (view profile)
Posted on: Tuesday, October 06, 2009 at 10:04 AM
Message: Fabiano, excellent article. Glad to see a dedicated professional and competent as you write to one of the references in SQL Server. This is the first of many excellent articles that you will write to the Simple-Talk.

Subject: Answer
Posted by: mcflyamorim (view profile)
Posted on: Tuesday, October 06, 2009 at 10:32 AM
Message: Peso,

I'm writing one new article called, "Date Correlation Optimization Internals".
I'll talk about the views and the behavior of the Correlation Optimization.

Thanks

Subject: Internals
Posted by: mcflyamorim (view profile)
Posted on: Monday, October 26, 2009 at 7:57 AM
Message: Fellows, the second part of this article can be read here!

http://www.simple-talk.com/sql/t-sql-programming/data-correlation-optimization-internals/

Subject: Wow! Didn't even know this existed
Posted by: MByrdTX (view profile)
Posted on: Monday, October 26, 2009 at 9:04 AM
Message: I've read follow-on article and have 2 questions.
1. What kind of load is put on the server to create all the system views? Would you do the alter database on a production server in a 24x7 environment.
2. Most of our tables in our OLTP database have a CreateDate (Not NULL) and an UpdateDate (NULL). I presume that the date correlation will try and use the CreateDates and ignore the UpdateDates???

Subject: Wow! Didn't even know this existed
Posted by: mcflyamorim (view profile)
Posted on: Monday, October 26, 2009 at 9:47 AM
Message: Hi MByrdTX,

I don't know if I understood correctly, but here we go!

1. You have to use this feature carefully, because this create some indexed views, and it can be dangerous. So my advice is, before put that into production server, try to do a lot of tests in a test server. Take a look at the internals article for more information.
2. To be able to use this feature, both columns must to be not null. And recall, the datetime column must be a key part of a clustered index. So in this case, you will need to create a clustered key using the CreateDate as the key column.

Subject: Really "Must read" article
Posted by: Rodrigo Moraes (view profile)
Posted on: Monday, October 26, 2009 at 11:52 AM
Message: Great article. Congratulations Fabiano.
Soon I'll implement this in my developer environment.
Thank you

 










Phil Factor
PowerShell SMO: Just Writing Things Once
 Sometimes, you can tire of writing the same PowerShell code once again. After this happened to Phil whilst keying in... Read more...



 View the blog
Top rated articles
PowerShell SMO: Just Writing Things Once
 Sometimes, you can tire of writing the same PowerShell code once again. After this happened to Phil... Read more...

Working with Continuous Integration in a BI Environment Using Red Gate Tools with TFS
 Continuous integration is becoming increasingly popular for database development, and when we heard of ... Read more...

SQL Source Control: The Development Story
 Often, there is a huge difference between software being easy to use, and easy to develop. When your... Read more...

SQL Scripts Manager: An Appreciation
 SQL Scripts Manager is Simple-Talk's present to its readers. William Brewer was an enthusiastic... Read more...

Optimizing tempdb configuration with SQL Server 2012 Extended Events
 One of the most obvious bottlenecks in the performance of tempdb is caused by PAGELATCH, in-memory... Read more...

Most viewed articles
Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk