Why is a hash match operator in this very basic query

Question

I'm beginning to learn some about looking at execution plans and making queries more efficient

Consider these two basic queries

select distinct pat_id, drug_class, drug_name from rx 

select pat_id, drug_class, drug_name from rx

and their execution plans

enter image description here

index being used:

CREATE CLUSTERED INDEX [ix_overlap] ON [dbo].[rx] 
(
    [pat_id] ASC,
    [fill_date] ASC,
    [script_end_date] ASC,
    [drug_name] ASC
)

Even though the first query supposedly has the higher cost by a 4:1 margin it runs faster than the second one. Why is it that a simple distinct added to the query will add the (what I assume to always be bad, corrections are welcome) hash match operator? And why does it have the higher query cost relative to the second query if it runs faster.

Mark Storey-Smith · Accepted Answer · 2013-03-31 22:11:25Z

The first query is using a parallel plan, meaning the "work" was split into multiple tasks carried out by multiple threads. The cumulative CPU time was therefore higher than for the serial plan used for your second query.

As to why the distinct causes the hash match operator to appear in the plan; an aggregate or sort operation is required to determine the DISTINCT result. @SQLKiwi might pop up with a more indepth explanation shortly but the hash match operator is apparently favoured for larger result sets.

The optimizer tends to prefer the Hash Match Aggregate on larger rowsets, with fewer groups, where there is no reason to produce a sorted output, and where the incoming rows are not sorted on the DISTINCT expression(s). Larger inputs favour hash matching because the algorithm generally scales well (although it does require a memory grant) and can make good use of parallelism. Fewer groups are better for hashing because it means fewer entries in the hash table, and the memory needed to store unique values is proportional to the number of groups (and the size of the group). Hash matching does not require or preserve the order of the incoming row stream. [Source]

When parallelism happens is the work being done by different processor cores or are there too many unknowns to be able to answer that? (or is that a poor question to ask) So it seems there are times that it's unavoidable to have a hash match in your execution plans and you shouldn't always strive to compeletely eliminate them?
Yes, different cores. Where has this motivation to eliminate hash operators originated from? All operators have their uses and the majority of the time, the optimiser does a good job of deciding the most efficient way to execute a query.
From the SQL Server Bible 2008 A hash match is an unordered join method that builds a temp table and iteratively matches with data from another table. A hash match is more efficient if one table is significantly larger than the other table. This is the worst-case join method and is used when no suitable index is available.
So from that you could see my confusion when a perfectly good index is there :)
Pop into chat tomorrow if you want to talk through the index options more. I'm sure we can come up with a better approach.

asked	2 months ago
viewed	125 times
active	2 months ago

Why is a hash match operator in this very basic query

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged sql-server performance index or ask your own question.

Why is a hash match operator in this very basic query

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql-server performance index or ask your own question.

Related