Join the Stack Overflow Community
Stack Overflow is a community of 6.3 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

This question is largely driven by curiosity, as I do have a working query (it just takes a little longer than I would like).

I have a table with 4 million rows. The only index on this table is an auto-increment BigInt ID. The query is looking for distinct values in one of the columns, but only going back 1 day. Unfortunately, the ReportDate column that is evaluated is not of the DateTime type, or even a BigInt, but is char(8) in the format of YYYYMMDD. So the query is a bit slow.

  SELECT Category 
    FROM Reports 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Note that the date converstion in the above statement is simply converting it to a YYYYMMDD format for comparison.

I was wondering if there was a way to optimize this query based on the fact that I know that the only data I am interested in is at the "bottom" of the table. I was thinking of some sort of recursive SELECT function which gradually grew a temporary table that could be used for the final query.

For example, in psuedo-sql:

N = 128
TemporaryTable = SELECT TOP {N} * 
                   FROM Reports 
               ORDER BY ID DESC 

/* Once we hit a date < Today, we can stop */
if(TemporaryTable does not contain ReportDate < Today) 
  N = N**2
  Repeat Select

/* We now have a smallish table to do our query */
  SELECT Category 
    FROM TemproaryTable 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Does that make sense? Is something like that possible?

This is on MS SQL Server 2008.

share|improve this question
4  
Why not index ReportDate? – Martin Smith Oct 12 '10 at 16:34
    
@Martin I'm not the data-owner, plus I am really just curious if the above is even possible. Ideally we would index on the report date (and have it be a non-char type such as bigint or actual date type) – Matt Oct 12 '10 at 16:36
2  
Will SELECT MAX(ID) FROM YourTable WHERE ReportDate < CONVERT(VARCHAR(8), GETDATE(), 112) find the limit of the range of interest or not necessarily? If so that might get the scan to start at the end of the index working backwards and it can stop as soon as it finds the first record for a previous day. – Martin Smith Oct 12 '10 at 16:40
1  
yes, what you're trying to do is possible - but without an index on ReportDate, it'll never be anywhere as fast as it could be.... – marc_s Oct 12 '10 at 17:02

I might suggest you do not need to convert the Date that is stored as char data in YYYYMMDD format; That format is inherently sortable all by itself. I would instead convert your date to output in that format.

Also, the way you have the conversion written, it is converting the current DateTime for every individual row, so even storing that value for the whole query could speed things up... but I think just converting the date you are searching for to that format of char would help.

I would also suggest getting the index(es) you need created, of course... but that's not the question you asked :P

share|improve this answer

Why not just create the index you need?

create index idx_Reports_ReportDate 
    on Reports(ReportDate, Category)
share|improve this answer
    
I am not the data owner - working on updating the index, but looking for alternative solutions (not to mention curious about alternatives in general) – Matt Oct 12 '10 at 19:01

No, that doesn't make sense. The only way to optimize this query is to have a covering index for it:

CREATE INDEX ndxReportDateCategory ON Reports (ReportDate, Category);

Update

Considering your comment that you cannot modify the schema, then you should modify the schema. If you still can't, then the answer still applies: the solution is to have an index.

And finally, to answer more directly your question, if you have a strong correlation between ID and ReportData: the ID you seek is the biggest one that has a ReportDate smaller than the date you're after:

SELECT MAX(Id) 
FROM Reports
WHERE ReportDate < 'YYYYMMDD';

This will do a reverse scan on the ID index and stop at the first ID that is previous to your desired date (ie. will not scan the entire table). You can then filter your reports base don this found max Id.

share|improve this answer
    
Intersting - is this a reverse scan because it is looking at Max(ID)? – Matt Oct 12 '10 at 16:48
    
Yes. MAX(Column) where there is an index on Column can be satisfied by a quick look at either end of the index. IF you have in addition a filter, then a scan operator can quickly find the MAX (or MIN) by starting from the appropriate end and then going one by one and checking the condition. First record that passes the filter is the desired value. – Remus Rusanu Oct 12 '10 at 16:51
    
Note that this assumes there is a deterministic relation between Id and ReportDate, ie. no dates are reversed in the Id order. – Remus Rusanu Oct 12 '10 at 16:52
    
That is correct, they are deterministic – Matt Oct 12 '10 at 17:17
2  
Others will google and land on this page. I prefer the rhetoric to be there and drive home the point that the solution to performance problems in relational tables is almost always in the shape of your schema (ie. a covering index), and almost never in the text of your query. – Remus Rusanu Oct 12 '10 at 19:13

I think you will find the discussion on SARGability, on Rob Farley's Blog to be very interesting reading in relation to your post topic.

http://sqlblog.com/blogs/rob_farley/archive/2010/01/21/sargable-functions-in-sql-server.aspx

An interesting alternative approach that does not require you to modify the existing column data type would be to leverage computed columns.

alter table REPORTS
add castAsDate as CAST(ReportDate as date)

create index rf_so2 on REPORTS(castAsDate) include (ReportDate)
share|improve this answer
    
+1, good read, thanks – Matt Oct 12 '10 at 18:51
    
You're most welcome. – John Sansom Oct 12 '10 at 19:35

One of the query patterns I occasionally use to get into a log table with similiar indexing to yours is to limit by subquery:

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

SELECT *
FROM
(
SELECT top 20000 *
FROM Reports
ORDER BY ID desc
) sub
WHERE sub.ReportDate = @ReportDate

20k/4M = 0.5% of the table is read.


Here's a loop solution. Note: might want to make ID primary key and Reportdate indexed in the temp table.

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

DECLARE @CurrentDate varchar(8), MinKey bigint


SELECT top 2000 * INTO #MyTable
FROM Reports ORDER BY ID desc

SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
FROM #MyTable

WHILE @ReportDate <= @CurrentDate
BEGIN

  SELECT top 2000 * INTO #MyTable
  FROM Reports WHERE ID < @MinKey ORDER BY ID desc

  SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
  FROM #MyTable

END

SELECT * FROM #MyTable
WHERE ReportDate = @ReportDate


DROP TABLE #MyTable
share|improve this answer
    
That is sort of what I had in mind - but without necessarily knowing what the size of the subquery is ahead of time. – Matt Oct 12 '10 at 16:46

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.