SQL Server clustered index design for performance
Matthew Schroeder, Contributor
Clustered indexes in SQL Server are a critical consideration in the overall architecture of the
database. They are often overlooked, misunderstood or, if the database is small, considered
unimportant.
This article points out the importance of clustered indexes for overall system performance and
maintenance as your database grows. I will briefly cover how SQL Server clustered indexes are
stored on disk, why they should always increase over time and why it is best that clustered indexes
be static. I'll also touch on many-to-many tables, why they are used and how clustered indexes make
these tables more efficient.
Finally, it is absolutely critical that we touch on the new SQL Server 2005 partitioned table
concept and examine how partitioned tables affect clustered
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
This was first published in January 2008
indexes. This will help you make the
right decisions from the very start.
Clustered indexes are created by default to match the primary key, which is defined on tables in
SQL Server. However, you can create a clustered index on any column and then define a primary key
on a separate column or columns. At this point, the primary key would be created as a unique
non-clustered index. Typically, a clustered index will match the primary key, but not necessarily,
so be careful. Given the variety of situations that can arise, I'll be discussing the clustered
indexes themselves, and for now ignore whether you choose to make them primary keys.
Clustered indexes actually hold the row data for SQL Server, so wherever your clustered indexes
are stored is also where your data is stored. The clustered indexes are organized into ranges of
data. For example, values 1 to 10 may be stored in one range and 90 to 110 in another range. Since
clustered indexes are stored as ranges, if you need to do a search on a range for an audit log, it
would be more efficient for the clustered index to be based on the date column that would be used
to return the date ranges. Non-clustered indexes work better for specific value searches, e.g.
"date = DateValue," rather than range searches, e.g. "date between date1 and date2."
Ever-increasing values for clustered indexes
Clustered indexes should be based on columns whose values constantly increase over time. In my
prior example on using the date column from an audit log, the date values for an audit log would be
constantly increasing and older dates would not be inserted into the table. This would be an
"ever-increasing" column. Another good example of an ever-increasing value is an identity column,
since, by design, it constantly increases.
Why am I spending so much time discussing ever-increasing values for clustered indexes? The most
important attributes of clustered indexes is that they are ever-increasing and static in nature.
The reason ever-increasing is so important has to do with the range architecture I outlined
earlier. If the values are not ever-increasing, then SQL Server has to allocate space within
existing ranges for those records rather than placing them in new ranges at the end of the
index.
If the values are not ever-increasing, then once the ranges fill up and a value comes in that
fits within a filled up index range, SQL Server will make room in an index by doing a page split.
Internally, SQL Server takes the filled up page and splits it into two separate pages that have
substantially more room at that point but take significantly more resources to process. You can
prepare for this eventuality by setting a fill factor of 70% or so, which gives you 30% free space
for incoming values.
The problem with this approach is that you continually have to "reindex" the clustered index so
it maintains a free space percentage of 30%. Reindexing the clustered index will also cause heavy
I/O load since it has to move the actual data itself and any non-clustered indexes have to be
rebuilt, adding greatly to maintenance time.
If the clustered index is ever-increasing, you will not have to rebuild the clustered index; you
can set a 100% fill factor on the clustered index, and at that point you will only need to reindex
the less-intensive, non-clustered indexes as time progresses, resulting in more up time.
Ever-increasing values will only add entries to the end of the index and build new ranges when
necessary. Logical fragmentation will not exist since the new values are continually added to the
end of the index and the fill factor will be 100%. The higher the fill factor, the more rows are
stored on each page. Higher fill factors require less I/O, RAM and CPU for queries. The smaller the
data types you pick for the clustered index, the faster the joins/queries will be. Also, since each
non-clustered index requires it to contain the clustered index key, then the smaller the clustered
index key and the smaller the non-clustered indexes will be.
The best data types for clustered indexes are generally pretty narrow. Referring to data type
size, it's typically a smallint, int, bigint or datetime. When datetime values are used as the
clustering index, they are the only column and are normally ever-increasing date values that are
often queried as range data. Generally, you should avoid compound (multiple columns) clustered
indexes except in the following situations: many-to-many tables and SQL Server 2005 partitioned
tables that have the partitioning column included as part of the clustered index to allow for index
alignment.
Many-to-many tables and clustered indexes
Many-to-many tables are used for their extremely fast join capabilities and their ability to
allow for quick re-association of records, from one owning record to another. Consider the
following structure:
Customer
|
CustomerID (bigint identity) |
Name |
Fieldn+ |
CustomerOrder
Orders
|
OrderID (bigint identity) |
Date |
Fieldn+ |
The clustered indexes in this structure would be CustomerID, OrderID. The compound would be
CustomerID/OrderID. Here are the benefits with this structure:
- The joins are all based on clustered indexes (much faster than joins to non-clustered
indexes).
- Moving an order to another customer only involves an update to the CustomerOrder table, which
is very narrow, with only one clustered index. Therefore, it reduces the blocking that would occur
if you had to update a wider table such as Orders.
- Use of a many-to-many table eliminates the need for some non-clustered indexes on the wider
tables such as Customer/Orders. Hence, it reduces the maintenance time on the large tables.
One negative result of this approach is the fragmentation that occurs on the CustomerOrder
table. However, that should not be a big issue, since the table is relatively narrow, has only two
columns with narrow data types and only one clustered index. The elimination of the non-clustered
indexes, which would be needed on the Orders table if it contained CustomerID, more than makes up
for this cost.
Clustered indexes and partitioned tables in SQL Server 2005
Partitioned tables in SQL Server 2005 are tables that appear to be a single table on the
surface, but behind the scenes ¬-- at the storage subsystem level -- they are actually multiple
partitions that can be spread across many filegroups. The table partitions are spread across
various filegroups based on the values in a single column. Partitioning tables in this manner
causes several side effects. I will just cover the basics here, to give you some understanding of
the factors involved. I recommend that you study partitioned tables before attempting to implement
them.
You can create a clustered index in this environment based on only one column. But, if that one
column is not the column the table is partitioned on,
 |
More on performance tuning and SQL Server indexes: |
|
|
|
 |
 |
then the clustered index is said to be non-aligned. If a
clustered index is non-aligned, then any snapping in/out (or merging) of partitions will require
you to drop the clustered index along with the non-clustered indexes and rebuild them from scratch.
This is necessary because SQL Server cannot tell what portions of the clustered/non-clustered
indexes belong to which table partitions. Needless to say, this will certainly cause system
downtime.
The clustered index on a partitioned table should always contain the regular clustering column,
which is ever-increasing and static, as well as the column that is used for partitioning the table.
If the clustered index includes the column used for partitioning the table, then SQL Server knows
what portion of the clustered/non-clustered indexes belong to which partition. Once a clustered
index contains the column that the table is partitioned on, then the clustered index is "aligned."
Partitions can then be snapped in/out (and merged) without rebuilding the clustered/non-clustered
indexes, causing no downtime for the system. Inserts/updates/deletes will also work faster, because
those operations only have to consider the indexes that reside on their particular partition.
Summary
SQL Server clustered indexes are an important part of database architecture and I hope you've
learned enough from this article to know why you need to carefully plan for them from the very
start. It is vital for the future health of your database that clustered indexes be narrow, static
and ever-increasing. Clustered indexes can help you achieve faster join times and faster IUD
operations and minimize blocking as the system becomes busy.
Finally, we covered how partitioned tables in SQL Server 2005 affect your choices for the
clustered index, what it means to "align" the clustered index with the partitions, and why
clustered indexes have to be aligned in order for the partitioned table concept to work as
intended. Keep watching for tips on non-clustered indexes (part two) coming in February and optimal
index maintenance (part three) in March.
SQL Server clustered and
non-clustered index design
Part 1: SQL Server clustered index design for
performance
Part 2: Designing SQL Server non-clustered indexes
Part 3: How to maintain SQL Server indexes
ABOUT THE AUTHOR
Matthew Schroeder is a senior software engineer who works on SQL Server database
systems ranging in size from 2 GB to 3+ TB, with between 2k and 40+k trans/sec. He specializes in
OLTP/OLAP DBMS systems as well as highly scalable processing systems written in .NET. Matthew is a
Microsoft certified MCITP, Database Developer, has a master's degree in computer science and more
than 12 years of experience in SQL Server/Oracle. He can be reached at [email protected].
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation