Recursive SQL query to speed up non-indexed query

Question

This question is largely driven by curiosity, as I do have a working query (it just takes a little longer than I would like).

I have a table with 4 million rows. The only index on this table is an auto-increment BigInt ID. The query is looking for distinct values in one of the columns, but only going back 1 day. Unfortunately, the ReportDate column that is evaluated is not of the DateTime type, or even a BigInt, but is char(8) in the format of YYYYMMDD. So the query is a bit slow.

  SELECT Category 
    FROM Reports 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Note that the date converstion in the above statement is simply converting it to a YYYYMMDD format for comparison.

I was wondering if there was a way to optimize this query based on the fact that I know that the only data I am interested in is at the "bottom" of the table. I was thinking of some sort of recursive SELECT function which gradually grew a temporary table that could be used for the final query.

For example, in psuedo-sql:

N = 128
TemporaryTable = SELECT TOP {N} * 
                   FROM Reports 
               ORDER BY ID DESC 

/* Once we hit a date < Today, we can stop */
if(TemporaryTable does not contain ReportDate < Today) 
  N = N**2
  Repeat Select

/* We now have a smallish table to do our query */
  SELECT Category 
    FROM TemproaryTable 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Does that make sense? Is something like that possible?

This is on MS SQL Server 2008.

@Martin I'm not the data-owner, plus I am really just curious if the above is even possible. Ideally we would index on the report date (and have it be a non-char type such as bigint or actual date type) — Matt, Oct 12 '10 at 16:36
Will SELECT MAX(ID) FROM YourTable WHERE ReportDate < CONVERT(VARCHAR(8), GETDATE(), 112) find the limit of the range of interest or not necessarily? If so that might get the scan to start at the end of the index working backwards and it can stop as soon as it finds the first record for a previous day. — Martin Smith, Oct 12 '10 at 16:40
yes, what you're trying to do is possible - but without an index on ReportDate, it'll never be anywhere as fast as it could be.... — marc_s, Oct 12 '10 at 17:02

Andrew Barber · Answer 1 · 2010-10-12 16:36:57Z

I might suggest you do not need to convert the Date that is stored as char data in YYYYMMDD format; That format is inherently sortable all by itself. I would instead convert your date to output in that format.

Also, the way you have the conversion written, it is converting the current DateTime for every individual row, so even storing that value for the whole query could speed things up... but I think just converting the date you are searching for to that format of char would help.

I would also suggest getting the index(es) you need created, of course... but that's not the question you asked :P

Joe Stefanelli · Answer 2 · 2010-10-12 16:36:19Z

up vote 3 down vote

Why not just create the index you need?

create index idx_Reports_ReportDate 
    on Reports(ReportDate, Category)

answered Oct 12 '10 at 16:36

Joe Stefanelli

92.3k10146167

I am not the data owner - working on updating the index, but looking for alternative solutions (not to mention curious about alternatives in general) – Matt Oct 12 '10 at 19:01

add a comment |

Remus Rusanu · Answer 3 · 2010-10-12 16:45:42Z

up vote 3 down vote

No, that doesn't make sense. The only way to optimize this query is to have a covering index for it:

CREATE INDEX ndxReportDateCategory ON Reports (ReportDate, Category);

Update

Considering your comment that you cannot modify the schema, then you should modify the schema. If you still can't, then the answer still applies: the solution is to have an index.

And finally, to answer more directly your question, if you have a strong correlation between ID and ReportData: the ID you seek is the biggest one that has a ReportDate smaller than the date you're after:

SELECT MAX(Id) 
FROM Reports
WHERE ReportDate < 'YYYYMMDD';

This will do a reverse scan on the ID index and stop at the first ID that is previous to your desired date (ie. will not scan the entire table). You can then filter your reports base don this found max Id.

edited Oct 12 '10 at 16:45

answered Oct 12 '10 at 16:38

Remus Rusanu

208k25270407

Intersting - is this a reverse scan because it is looking at Max(ID)? – Matt Oct 12 '10 at 16:48

Yes. MAX(Column) where there is an index on Column can be satisfied by a quick look at either end of the index. IF you have in addition a filter, then a scan operator can quickly find the MAX (or MIN) by starting from the appropriate end and then going one by one and checking the condition. First record that passes the filter is the desired value. – Remus Rusanu Oct 12 '10 at 16:51

Note that this assumes there is a deterministic relation between Id and ReportDate, ie. no dates are reversed in the Id order. – Remus Rusanu Oct 12 '10 at 16:52

That is correct, they are deterministic – Matt Oct 12 '10 at 17:17

2

Others will google and land on this page. I prefer the rhetoric to be there and drive home the point that the solution to performance problems in relational tables is almost always in the shape of your schema (ie. a covering index), and almost never in the text of your query. – Remus Rusanu Oct 12 '10 at 19:13

| show 1 more comment

John Sansom · Answer 4 · 2010-10-12 17:41:57Z

up vote 1 down vote

I think you will find the discussion on SARGability, on Rob Farley's Blog to be very interesting reading in relation to your post topic.

http://sqlblog.com/blogs/rob_farley/archive/2010/01/21/sargable-functions-in-sql-server.aspx

An interesting alternative approach that does not require you to modify the existing column data type would be to leverage computed columns.

alter table REPORTS
add castAsDate as CAST(ReportDate as date)

create index rf_so2 on REPORTS(castAsDate) include (ReportDate)

edited Oct 12 '10 at 17:41

answered Oct 12 '10 at 17:36

John Sansom

28.1k75170

+1, good read, thanks – Matt Oct 12 '10 at 18:51

You're most welcome. – John Sansom Oct 12 '10 at 19:35

add a comment |

David B · Answer 5 · 2010-10-12 16:54:51Z

One of the query patterns I occasionally use to get into a log table with similiar indexing to yours is to limit by subquery:

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

SELECT *
FROM
(
SELECT top 20000 *
FROM Reports
ORDER BY ID desc
) sub
WHERE sub.ReportDate = @ReportDate

20k/4M = 0.5% of the table is read.

Here's a loop solution. Note: might want to make ID primary key and Reportdate indexed in the temp table.

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

DECLARE @CurrentDate varchar(8), MinKey bigint


SELECT top 2000 * INTO #MyTable
FROM Reports ORDER BY ID desc

SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
FROM #MyTable

WHILE @ReportDate <= @CurrentDate
BEGIN

  SELECT top 2000 * INTO #MyTable
  FROM Reports WHERE ID < @MinKey ORDER BY ID desc

  SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
  FROM #MyTable

END

SELECT * FROM #MyTable
WHERE ReportDate = @ReportDate


DROP TABLE #MyTable

That is sort of what I had in mind - but without necessarily knowing what the size of the subquery is ahead of time. — Matt, Oct 12 '10 at 16:46

asked	6 years ago
viewed	1222 times
active	6 years ago

current community

your communities

more stack exchange communities

Recursive SQL query to speed up non-indexed query

5 Answers 5

Your Answer

Not the answer you're looking for? Browse other questions tagged sql sql-server tsql sql-server-2008 recursion or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Recursive SQL query to speed up non-indexed query

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql sql-server tsql sql-server-2008 recursion or ask your own question.

Related

Hot Network Questions