Optimize T-SQL data types in SQL Server
Matthew Schroeder, Contributor
Many SQL Server developers do not give data types a lot of thought when they design database
schemas. Initially it doesn't seem to matter much and the performance looks pretty good, but, over
time, the system slows down. Let's take a look at the various SQL Server data types, the effects
they have on performance, and what you should consider when choosing a data type.
Optimizing data types means choosing them carefully, as they have a huge impact on I/O, CPU and
RAM consumption. A larger data type takes up more room in the cache (RAM) resulting in a smaller
hit ratio on the cache, which causes more fetches from disk (more I/O). This results in a heavier
CPU hit, longer update times, increased lock times and increased contention on the table, which
will show up as blocks.
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
This was first published in December 2007
Indexes defined on larger data types take longer to scan/seek because more
pages have to be read, which will impact RAM, CPU and I/O. Larger data types also increase
maintenance times on the various indexes, yielding longer "maintenance windows."
Primary key considerations
Clustered indexes default to the primary key fields in SQL Server. That said, several factors
are important with data types. For this discussion, we will assume that your primary key is the
same as the clustered index, since that is the preferred approach.
Any data type you pick for the primary key (usually also the clustered index) is included within
every non-clustered index as a reference back to
 |
Tips on T-SQL data types and SQL Server development: |
|
|
|
 |
 |
the original row. For example, if you pick the data type
datetime as the primary key rather than an int type, any non-clustered indexes will be an
additional 4 bytes larger per row because the non-clustered index has to hold the primary key
value. In tables with more than two billion rows, this can quickly add more than 2.4 GB per
non-clustered index. You'll see the impact on CPU, I/O and RAM quickly accumulate in this
scenario.
Smaller data types make it easier -- i.e., faster and less resource intensive -- for SQL Server
to ensure that any inserted or updated values are unique against the existing table values.
Individual data type considerations
We will just cover the commonly misused data types here.
- Datetime – Usually when storing a datetime value, everyone defaults to the datetime data
type without considering smalldatetime. The datetime type is stored as two ints accurate to 3.33
milliseconds with a range of Jan 1, 1753-Dec 31, 9999. However, for many database operations, this
kind of range and precision is not necessary. The smalldatetime data type is stored as one int,
accurate to one minute with a range of Jan 1, 1900 – Jun 6, 2079.
- Many datetime values can be stored as a
smalldatetime when the accuracy requirements do not have to be precise to the sec/ms and the dates
are generally limited to date ranges within a few decade ranges. The benefits of using
smalldatetime are apparent, since at 50% space saved, a table is much more efficient. Datetime data
types should not be used within a primary key unless frequent searches are done on date
ranges.
- Bit vs. Char(1) – One bit will
take up 1 byte, 8 bits will still take up 1 byte, and a char(1) takes 1 byte. It could be argued
that using a char(1) to store Y/N is more readable, but it is far more efficient to use a bit
column if the table contains more than 1 bit column. You'll also eliminate the possibility of
someone trying to put an A/B/C/etc. in the char(1) field, leading to incorrect data/functionality.
Generally, it is considered better practice to use the bit data type. Even if the table only
contains a 1 bit column, it will allow you to add additional bit columns in the future. There will
be no need to modify the data type/data on the existing char(1) column in order to take advantage
of the optimal data storage configuration.
- Integer – Integer types consist of tinyint (0-255, 1 byte), smallint (-32,768-32,767, 2
bytes), int (–2,147,483,648-2,147,483,647, 4 bytes) and bigint
(–9,223,372,036,854,775,808-9,223,372,036,854,775,807, 8 bytes). These data types are often used in
primary keys and non-clustered indexes, so it's important to pick the smallest data type available
from the list.
- It's tempting to just pick a large data
type in SQL Server, but it's better to plan for immediate needs in the next 5 to10 years. You can
always convert to a large data type later when CPUs, I/O and RAM have more bandwidth to support
larger types. Also, since this data type is often used in primary keys (and consequently
non-clustered index keys), the smaller the data type, the less work SQL Server will have to do
during index maintenance.
- UniqueIdentifier – This data
type consists of 16 byte storage, so it's very wide and is used to store GUIDs. It is commonly used
as a primary key due to its unique nature. It is generally easier to merge two (or more) existing
databases that have UniqueIdentifiers as primary keys as opposed to int types. The latter would
require every row to be adjusted for a database merge.
With that in mind, be careful with this data type, since often you will have a UniqueIdentifier
as a primary key along with several UniqueIdentifier foreign keys, making all the indexes very
large, resource intensive and much more difficult to maintain. The increased row/index sizes also
increase the overall load on the server. UniqueIdentifiers make it harder for SQL Server to check
for "uniqueness" on inserts/updates, and that puts more pressure on the server. Generally, for
high-transaction OLTP systems, you are better off avoiding the use of UniqueIdentifiers as primary
keys.
When this data type is used as a primary key, you should use the NewSequentialID() rather than
the NewID() function. That's because NewSequentialID always grabs a sequential ID, resulting in a
clustered index that is "ever-increasing," which eliminates fragmentation on the clustered index.
The NewID() function will cause heavy fragmentation on a clustered index due to page splits.
- Decimal – In SQL Server 2005 SP2 and greater (Enterprise, Developer and Evaluation
Editions only), you can turn on the vardecimal option (database properties/Options/VarDecimal
Storage Format Enabled). The vardecimal option varies the storage required for the decimal data
type, depending on the actual data stored in the column. You can only turn it on at the database
level rather than the table level.
- In order to determine the space that will be saved on a particular table, run the
sp_estimated_rowsize_reduction_for_vardecimal (located in master) system stored procedure. No
changes need to be made on the column to use vardecimal and it will have no functional impact on
code, since it is merely performance-related functionality.
- (n)Char(n) vs. (n)Varchar(n) – If a text value is always fixed in length, then it's best
to use char(n). Varchar data types require an additional 2 bytes of storage to store offset
information, so that space is wasted if the text size is fixed.
- (n)Varchar(n), (n)Varchar(Max), VarBinary(n), VarBinary(max), (n)Text, and Image-- When
you plan to store large values, it is very important to pick the correct data type. These data
types have the potential to be an issue, depending on the sizes chosen. If you expect the text/data
size to be fairly small, then it is recommended that you store the data on the data row, meaning
you would want to use (n)Varchar(n) or VarBinary(n), which makes it subject to the 8k limit. If the
data is potentially large, then (n)Varchar(Max) or VarBinary(max) should be used, as the
VarBinary(max), (n)Varchar(Max), (n)Text, and Image data types can be stored "out of row."
Storing large object data "out of row" means that the regular data row merely contains a pointer
to another data row that stores the large object data, thus eliminating the 8k limit for the large
object and resulting in smaller data row sizes. If a query does not require/return the large
object, then it just scans/seeks on the smaller data row returning results much faster. It's able
to transverse fewer data pages than if the large object was stored "in row."
If a query does require/return the large object, then it scans/seeks for the smaller data row
and uses the pointer on that row to return the large object. You are better off not using the
(n)Text or Image data types, replacing them with (n)Varchar(Max) and VarBinary(max),
respectively.
If you choose to use the VarBinary(max), (n)Varchar(Max), (n)Text, or Image data type then it is
best if the option to store the value "out of row" is chosen. This setting is controlled by calling
the system stored procedure sp_tableoption N'MyTable', 'large value types out of row', 'ON'.
The size of the data rows and indexes impacts a wide range of hardware resources. SQL Server
performance will increase exponentially as you reduce the size of the rows and indexes. A few fast
points to remember: Always pick the smallest data type, use bits and integers (the smaller the int
type the better) with lookup tables where possible – trying to avoid the use of (n)varchars.
If you have to use large text/binary values, then consider defining them as (Max) types and
setting the options to store the data "out of row." If the text data is fixed length, then always
use (n)char(n) data types so 2 bytes is not wasted storing offsets. If your primary concern is
performance, avoid using any data types other than ints for primary/foreign keys, as this will
significantly reduce your data row and index sizes. SQL Server data types seem like a small issue
at design time, but they will have a huge impact on your system as it grows organically over
time.
ABOUT THE AUTHOR
Matthew Schroeder is a senior software engineer working on SQL Server database systems,
ranging in size from 2 GB to 3+ TB, with between 2k and 40+k trans/sec. Matt currently works for
the gaming vendor, IGT, providing services to gaming companies. He also works as an independent
consultant, specializing in SQL Server, Oracle and .NET for industries such as gaming, automotive,
e-commerce, entertainment, banking and non-profit. Matt specializes in OLTP/OLAP DBMS systems as
well as highly scalable processing systems written in .NET. He is a Microsoft certified MCITP:
Database Developer, has a masters degree in Computer Science, and has more than 12 years of
experience in SQL Server/Oracle. He can be reached at [email protected].
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation