SQL Server clustering best practices
Hilary Cotter with Geoff Hiten, Contributors
SQL Server clustering is a high-availability technology for SQL Server instances. It involves
the sharing of server resources between one or more nodes (or servers), which have one or more
shared disks grouped into logical units called resource groups. A resource group containing at
least one IP address, network name and disk resource is called a virtual server. The cluster
service arbitrates ownership of the resource groups. A single node can own a resource group and its
associated resources at any given time.
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
This was first published in July 2006
The following table of contents will help you navigate this tip.
Clustering basics
Each virtual server appears on the network as a complete system. When the virtual server
contains SQL Server resources, clients connected to the virtual server access resources on its
current host node. While the terms "active" and "passive" are often used in this context, they are
not fixed roles, as all nodes in a cluster are interchangeable. Should the current host, sometimes
designated as the primary, fail, the resource group will be transferred to another node (secondary
node) in the cluster. With clusters having more than two nodes or two instances, it is important to
set failover order by choosing the preferred node ownership order for each instance. The secondary
will become the primary and host the virtual server. Active client connections will be broken
during failover, but they can reconnect to the virtual server now hosted by the new node. The
clients will have to reconnect manually, and work in progress will be lost during the failover.
Most commercial applications now handle this reconnection task seamlessly.
The goal of clustering is to provide increased availability to clients by having a hot standby
system with an automatic failover mechanism. SQL Server clustering is not a load-sharing or
scale-out technology. On all clusters during a failure there will be a brief database server
interruption. On large clusters with multiple nodes and instances, clients may experience degraded
performance during a failure event but they will not lose database availability.
Clustering topologies
There are four types of cluster topologies -- or arrangements of nodes in a cluster:
- Single instance
- Multi-instance
- N+1
- N+M
Single instance: In this case, one node in a cluster owns all resource groups at any one
time and the other nodes are offline. Should the primary node owning the resources fail, the
resource groups will be transferred to the secondary node, which comes online. While the secondary
node comes online, it will assume ownership of the resource groups, which typically consist of
disks containing your database files and transaction logs. The secondary node comes online (starts
up), and SQL Server will start up on the virtual server and roll uncommitted transactions in the
transaction log backward or forward as it recovers the database.
This topology was formerly called active-passive. Single-instance clustering is most frequently
used for mission-critical applications, where the cost of downtime far outweighs the cost of the
wasted hardware resources of the secondary node sitting idle while offline.
Multiple instance: In this situation, one virtual server in a cluster owns some of the
resource groups and another virtual server owns other resource groups. At any one time, the virtual
servers themselves can be hosted by a single node or different nodes and would appear to clients as
named instances of a single server. In that case, they are named instances of a virtual server,
hence the name multiple instance. With multiple-instance clustering, previously called
active-active, the hardware requirements of each individual node are greater as each node may at
any one time be hosting two (or more) virtual servers.
You should consider multiple-instance clusters to be more cost effective than single-instance
clusters as there are no nodes offline or waiting. However, should one node host more than one
virtual server, performance for clients is typically degraded. Your best bet is to use multiple
instances when you require high availability but not high performance.
N+1: This is a modification of multiple-instance clustering topologies where two or more
nodes share the same failover node. The secondary node will need enough hardware capabilities to
support the load of all N servers at any one time should they all fail over simultaneously. You can
achieve cost savings if multiple clusters use the same failover node. However, the cost of an
individual node tends to be small in comparison to other related clustering costs, such as
storage.
Many people consider N+1 to be more cost effective than multiple-instance clustering because
there is only one secondary node offline (or waiting) for several active nodes. However, depending
on the hardware configuration of the failover node, it does not offer the performance of
multiple-instance clustering. Use N+1 in environments where cost constraints force you to reduce
the number of failover nodes and you need high availability but not high performance.
N+M: In a situation where you have two or more working nodes in a cluster along with two
or more standby nodes, it is typically configured in eight-node clusters with six working nodes for
every two standby, or five working nodes for every three standby.
N+M offers some of the cost benefits of N+1, but it has a lower chance of performance
degradation during a multiple failure event than N+1 since the failover node(s) do not have to
support the entire load of the failed nodes. Use N+M in environments where cost constraints force
you to reduce the number of failover nodes and at the same time provide a high level of
performance.
Clustering dependencies
SQL Server clustering has several dependencies:
- Network
- Hardware
- Software
Network dependencies: Clustering requires a private network among all nodes in a cluster.
Clustering services use a private communication channel on each node to keep in sync with each
other. This allows the cluster to communicate and act appropriately even if the public network is
offline. Looks-Alive and Is-Alive checks -- used by cluster services to determine if a cluster
resource group is "up" -- connect over the public networks to best emulate a client connection
process.
Hardware dependencies: Clustering requires specialized hardware and software. And to
share resources between nodes, you need specialized disk controllers. Clustering hardware must be
certified by Microsoft to meet the requirements of clustering. And, you must have a second set of
network cards to provide the private network between cluster nodes.
Software dependencies: To benefit from clustering services, you need specialized versions
of the operating system (Windows 2000 and 2003 Enterprise or Data Center editions). You will also
need SQL Server 2000 Enterprise Edition, SQL Server 2005 Standard Edition (up to two nodes) or SQL
Server 2005 Enterprise Edition (up to eight nodes).
Clustering best practices
What follows is a list of clustering best practices. I have broken these down according to
dependencies.
Network best practices
There are two different and contradictory settings required for the public network and the
private network in clustering.
Private
Ensure the private network is private. Clustering requires a 150-ms ping response time. If your
private network is saturated or congested with other network traffic, you may find your clusters
failing over unexpectedly. On your private network, consider isolating traffic by implementing a
VLAN (virtual LAN), a separate subnet or use a crossover cable for Single-instance clusters. The
actual traffic generated by cluster communication is small, so high-bandwidth networks are
unnecessary. However, they must still be low latency and reliable. Make sure the following points
are established on the private network:
- Use TCP/IP as the only protocol bound to the NIC.
- No default gateway is configured.
- No DNS servers are configured unless the cluster nodes are DNS servers, in which case 127.0.0.1
should be configured.
- No DNS registration or DNS suffix is configured.
- No WINS servers are configured.
- Static IP addresses are used for all nodes.
- NetBIOS over TCP/IP is disabled.
- No NIC teaming is used, where two network interface cards are aggregated together to act as a
single NIC card.
Public
For your public network, use at least two WINS or DNS servers on your cluster network segment or
VLAN. While installing your cluster you will have to resolve cluster, DC (domain controller) and
virtual server names. You must have a name server on your network for this. You can decrease the
time required for a node to fail over by providing a name server on your network as well.
Use at least two DCs on your network. Clustering requires DCs not only during setup but also for
normal functioning and failover.
If you use NIC teaming for greater bandwidth throughput and reliability, do not configure it
while building the cluster. Add NIC teaming as a last step before final testing. Be prepared to
"undo" NIC teaming as an early step in troubleshooting. Microsoft Customer Support Services (CSS)
will likely direct you to disable teaming as a first diagnostic step, so be ready.
Both
Ensure that your network card settings are identical for every server in your cluster and that
they are not configured to automatically detect network settings.
Software best practices
Ensure applications are cluster aware and will not lose work or fail to meet the SLA during a
cluster failover.
Ensure transactions are as small as possible in your application and on any jobs that may run on
your clustered SQL Servers. Long-running transactions increase the length of time required to apply
the transaction log on the failover node and consequently increase the amount of time for
failover.
Do not run antivirus software on cluster nodes. If you must run antivirus software, be sure the
quorum disk and database files are excluded from the scans. Even in this configuration, there have
been reports of antivirus drivers interfering with cluster disk resource failover. Test your setup
and make sure it fails as expected. Select another antivirus product if yours causes problems.
Make sure there are no password expiration policies in use for any of the cluster-related
accounts. Cluster accounts should:
- be the same for all nodes in the cluster;
- have domain accounts (but not domain admin accounts) and have local administrative rights on
each node in the cluster. SQL Server 2005 forces you to set up domain-level groups for these
accounts and then grants appropriate rights to the groups.
- have the least security privileges to minimize damage that could be done to the node or other
servers on your network should the password be compromised or the account be hijacked by a buffer
overflow.
Ensure all software components are the same version (i.e., SQL Server 2005 Standard), same
architecture (i.e., 64 bit for all OS and SQL Server components) and at the same service pack and
hot fix level. The exception is that individual SQL Server instances can be at different releases,
editions and hotfix levels.
Ensure all external software dependencies (COM components, file paths, binaries) are either
cluster aware or installed on all nodes in a cluster. MSDTC (Microsoft Distributed Transaction
Coordinator) is the most common external dependency in a cluster. While it is not necessary, many
people install it before installing SQL Server because installing it later is much harder.
When installing a cluster, consider installing a single-node cluster and adding nodes to the
cluster as required. This way, if the cluster setup fails while adding a single node, you are left
with a working cluster (although it could be a single-node cluster).
While applying hot fixes or service packs that require a system reboot, apply it to the primary
(current instance host), fail over to the secondary, reboot the primary, fail back to the primary
and reboot the secondary. Typically hot fixes and service packs are cluster aware and install on
all cluster nodes simultaneously.
Hardware
Ensure that your cluster is approved by the vendor and that it is part of the Microsoft Windows
Catalog with a specific endorsement for clustering.
Ensure each node in your cluster has identical hardware and components.
Regularly check vendor Web sites for potential hardware problems, fixes and BIOS patches for
each component in your cluster.
Use the appropriate RAID technology to ensure that your disk array is fault tolerant. Be as
proactive as possible in replacing failed or marginal disks. A disk failure will put a greater load
on the remaining disks in an array and may cause other marginal disks to fail. Depending on your
RAID technology, your RAID array may not be tolerant to more than one disk failure per array.
Ensure you have properly conditioned or charged batteries on any array controlle. It prevents
data loss or corruption in the event of a power failure.
Use uninterrupted power supplies and be sure you have redundancy in your power supplies.
Use Hot-Add Memory if it's supported by your SQL Server version, operating system and hardware.
Hot-Add Memory is a hardware technology that allows you to add memory to a running system; the OS
detects and uses the additional memory. Windows Server 2003, Enterprise and Data Center Editions,
as well as SQL Server 2005 Enterprise Edition can take advantage of Hot-Add Memory. Read about Hot-Add Memory Support in
Windows Server 2003.
Use ECC (Error Correction Code) memory chips, which store parity information used to reconstruct
original data when errors are detected in data held in memory.
Use fault-tolerant NICs and network devices (switches).
Summary
Clustering is a relatively new technology and has a reputation for being fragile. SQL Server
2000 clustering is far simpler than the earlier versions and has proven to be much more reliable.
Today, clustering on SQL Server 2000 and SQL Server 2005 is a highly reliable technology, but it
still has many dependencies that prevent it from meeting your high-availability goals. Foremost
among these dependencies is a staff that is trained and knowledgeable. Running a close second is
having operating processes and procedures that are designed to work specifically with a SQL Server
cluster. Ensure that you address all of your clustering dependencies to deliver high availability
with SQL Server clustering.
About the author: Hilary Cotter has been involved in IT for more than 20 years as a
Web and database consultant. Microsoft first awarded Cotter the Microsoft SQL Server MVP award in
2001. Cotter received his bachelor of applied science degree in mechanical engineering from the
University of Toronto and subsequently studied economics at the University of Calgary and computer
science at UC Berkeley. He is the author of a book on SQL Server transactional replication and is
currently working on books on merge replication and Microsoft search technologies.
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation