Interacting With Data Using Multiple Databases/Servers

Question

All of the projects that I've had to deal with so far have only required a single database on a single server. I'm interested in learning more about how projects that need to scale move to multiple databases and/or servers to help manage the load. I'm aware of High Scalability, but I'm particularly interested in some code examples or additional resources where I could read more on the subject.

For instance:

How are joins constructed between two tables on multiple databases? (A code example here would be helpful).
Are there any special strategies for tracking which tables are in which database?
Does the application code need to know that one or more databases are spread across multiple servers? If not, at what level are the requests filtered?
When is it time to move beyond a 1 database/1 server setup? How common is it to need to do this?

This question might be better answered on Database Administrators. Nothing really wrong with it here either, though, so I'm just gonna check with DBA mods. If it's suitable there, would you like it migrated? — Anna Lear♦, Nov 8 '11 at 22:05
@AnnaLear - I guess it kind of depends on the answers. At this point, I'm more interested in the application side of the issue, so for now, I think it might be better here. — VirtuosiMedia, Nov 8 '11 at 22:12
@AnnaLear ack, agree with the OP then if they want app specific code. — jcolebrand, Nov 8 '11 at 22:57

TetonSig · Accepted Answer · 2012-02-13 01:32:08Z

Ok, let’s break it down:

How are joins constructed between two tables on multiple databases? (A code example here would be helpful).

This is pretty straightforward. SQL Objects have anywhere from a one to four part naming convention:

Servername.databasename.schemaname.tablename

If all your tables are on the same server on the same database , with the same owner/schema, you can just ignore the first three parts and use what you are most used to:

Select a.*,b.* from 
tableA a inner join 
tableB b on a.col1=b.col1

If one of your tables is in a different database and both use the default schema for their databases, then you simply add the database to the second table:

Select a.*,b.* from 
tableA a inner join 
databaseC..tableB b on a.col1 = b.col1

If you happen to be in a third database different from either of the ones you are querying you use both database names explicitly:

Select a.*,b.* from 
databaseD..tableA a inner join 
databaseC..tableB b on a.col1 = b.col1

If you end up using different schemas and/or owners you can add those in:

Select a.*,b.* from 
databaseD.john.tableA a inner join 
databaseC.accounting.tableB b on a.col1 = b.col1

And lastly, if you are very careful about it and have a very good reason, you can join a (usually small) table on another server:

Select a.* from 
databaseD.john.TableA a inner join 
ATLANTA.databaseC.accounting.tableB b on a.col1 = b.col1

When is it time to move beyond a 1 database/1 server setup? How common is it to need to do this? Are there any special strategies for tracking which tables are in which database?

I’ll combine these two because they go together. You’re almost always generally fine to start off with the assumption that one database one server is enough until your design/business/technical constraints force you to use more.

So to answer your second question first, since you generally have a reason for having separate databases, it should be fairly obvious from knowing the design of your system where something is.

As to when/why it’s necessary to move beyond a single database. Usually it’s a mix of business rules, politics, and/or technical reasons.

For instance, where I work we have 16 databases spread across 4 servers. We have a MainDB, ImageDB, referencetableDB, HighvolumeTransactionDB, ReportingDB, StagingDB, ProcessingDB, ArchiveDB, FinancialDB. To give some examples of why they are different:

FinancialDB, sensitive information
Image DB, specific different storage and recovery requirements
ReferenceDB, low transaction, high read
ReportingDB, very high read, needs to be restored/replicated to various other environments unlike a lot of the other data
StagingDB, nothing permanent, just a beefed up tempdb that we have more control over
MainDB, interfaces with all the other DBs but needs differential backups so...we split out the
HighVolumeTransaction tables, (which are relatively transient), to their own DB so as to keep the backup reasonable size.
Archive, Lots of the same data from Main and Reporting, but with longer retention periods and harder hitting queries digging deep in the data. If this was still combined with Main/Reporting it would bog our system down.

• Does the application code need to know that one or more databases are spread across multiple servers? If not, at what level are the requests filtered?

In a broad sense, they probably do. At a minimum they need to know what server they are pointing at in the database connection string. Processing, Reporting, Main, etc.

From there, they do need a database context to execute under. Generally that would be the most used one for the application, maybe even the original one from the one database/one server days of the application. You CAN have the application explicitly switch database context on every call but that makes it very hard to adjust the database without changing the app.

The usual, (or at least, MY usual), approach is to always access through one or maybe two main databases.

Then create views into other databases as necessary combined with interfacing with the database through stored procedures.

So to illustrate:

Let’s say you want to get a Client’s demographic information, Sales data and Credit balance and that’s spread across three tables originally all in the MainDB.

So you write a call from your app:

Select c.ClientName, c.ClientAddress, s.totalSales,f.CreditBlance from
Clients c join Sales s on c.clientid = s.clientid inner join AccountReceivable f on 
c.clientid=f.clientid where c.clientid = @clientid

Awesome. However, now anytime we change a columname, or rename/move a table, you have to update the app code. So instead we do two things:
Create Clients, Sales, AccountReceivables Views (you wouldn’t use Select * but I’m demoing here)

Use MainDB
GO
Create view v_Clients as select * from Clients
Create view v_Sales as select * from Sales
Create view v_AccountReceivable as select * from AccountReceivable
Go

Then we’d also create a stored procedure, spGetClientSalesAR

Create proc spGetClientSalesAR @clientID int
as
Select c.ClientName as ClientName, 
       c.ClientAddress as ClientAddress, 
       s.totalSales as TotalSales, 
       f.CreditBlance as CreditBalance 
from
v_Clients c join v_Sales s 
    on c.clientid = s.clientid 
inner join v_AccountReceivable f 
    on c.clientid=f.clientid 
where c.clientid = @clientid

And have your app call that.

Now as long as I don’t change the interface on that stored proc, I can pretty much do anything I need to do to the backend database to scale up or out.

In the extreme, I could even make my old MainDB just a bunch of shelled stored procedures and views such that underneath those views we created looked like this:

Create view v_Clients as select * from ServerX.DatabaseY.dbo.Clients
Create view v_Sales as select * from ServerQ.DatabaseP.dbo.Sales
Create view v_AccountReceivable as select * from ServerJ.DatabaseK.dbo.AccountReceivable

And your app would never know the difference, (assuming fast pipes and well staged data among other things).

Obviously that’s extreme and I’d be lying if I said everything was planned this way, but using stored procedures/views even if you do it while refactoring will allow you a lot of flexibility as your app grows from its humble one database/one server beginning.

TetonSig - Thanks for the answer. I wasn't able to get back to the question in time to award you the full bounty (I was travelling), but I created a new bounty for the question and will be able to award it to you in 24 hours. — VirtuosiMedia, Feb 20 '12 at 16:19
Wow, thanks. I appreciate that. It was a lot of fun answering the question. — TetonSig, Feb 20 '12 at 16:26

GrandmasterB · Answer 2 · 2011-11-08 22:15:24Z

The primary way I've encountered multiple database servers in the web-world (since the question is tagged PHP) is setups where there was one 'master' (write) database, and then one or more replicated 'slave' (read) databases. Database writes are performed against the 'master' database. The contents of that database are replicated to the 'slave' servers in near real-time. Queries - particularly intensive reports - are then run against one of the 'slave' databases to shift the load to those servers. Keep in mind, that particular setup is best for applications that have a lot of reads, but not a lot of writing. It is by no means the only way to arrange things.

Aaronaught · Answer 3 · 2011-11-08 22:21:31Z

How are joins constructed between two tables on multiple databases? (A code example here would be helpful).

They're not. NoSQL databases don't do "joins" at all, and even if you could do a SQL join across RDBMS servers, you wouldn't want to if you value performance (c.f. fallacies of distributed computing).

Are there any special strategies for tracking which tables are in which database?

In a relational/SQL database, partitioning is normally done within the confines of a single server/database, using different files placed on different disks. Almost by definition a horizontal scaling solution means that all databases have all the tables and you have some sort of transactional mirroring, replication, or custom eventual-consistency solution to make sure all the data gets to where it's supposed to.

If you're actually splitting the database up logically and not just physically, then the mappings defined in your DAL or ORM will declare which tables are in which database.

NoSQL databases are a mix of partitioning solutions. Sometimes it's the "tables" (or more commonly, "collections") that get partitioned. Other times it's the "rows" (or "documents"). In some cases it's actually the columns, as in a column-oriented database like HBase. It totally depends on the technology you're using. The one thing that these all have in common is that the engine itself keeps track of it all, so all you have to do is request a document or row.

That is of course assuming you're actually making use of the sharding features and not just creating a bunch of different databases. If you're doing the latter, then you're on your own.

Does the application code need to know that one or more databases are spread across multiple servers? If not, at what level are the requests filtered?

If they are different logical databases, yes. If they are only physically distributed then no - assuming that either your specific database natively supports sharding or you use a load balancing solution (for SQL databases). Also assuming that all of your operations are stateless; if you want horizontal scaling, you're going to have to give up ACID.

When is it time to move beyond a 1 database/1 server setup? How common is it to need to do this?

It's time when you've optimized everything you possibly can on the one server and still can't squeeze out enough performance due to constraints on the I/O load. If you have to ask the question, then it's too early.

Note that performance problems in a decent RDBMS product (Oracle, SQL Server) are more frequently due to poor design, poor indexing, poor queries, lock contention, and so on; these products can scale vertically to a ridiculous degree. So again, you should consider "moving beyond a 1 database/1 server setup" when you are absolutely certain that your performance problems are due to hardware limitations and not just a sub-par design/implementation.

Or, I guess, another reason some people switch to distributed databases is when they're not prepared to pay much (or any) money in licensing fees and want to ditch SQL as a conscious choice to trade the low cost for increased application complexity. Totally valid reason if you're a software startup but usually not applicable in the corporate sector.

+1 - I wasn't really considering NoSQL, but this is helpful all the same. Thanks. — VirtuosiMedia, Nov 8 '11 at 22:30

Henrik · Answer 4 · 2012-02-12 20:07:27Z

There are three major types of replication configurations for databases:

Master-Slave
Master-Master
Consensus

Master-Slave example: MySQL master + MySQL slaves, MongoDB

Master-Master example: CouchDB, Cassandra, Riak

Consensus example: ScalienDB

...to name a few.

These have different characteristics. Master-slave configs allow slave nodes to catch up with the master at their maximum rate while serving read-requests very rapidly, while the master server is responsible for the data integrity. Because all writes go to the master, there is never lock contention because a single relatively slow writer is blocking many readers, but on the other hand, the slave servers are eventually consistent and you don't get the transaction isolation guarantees that you would have from reading only from the master. (further reading; ACID vs BASE, Transaction isolation levels, database replication, MVCC/Isolation: Snapshot, Transactional Replication)

Master-Master always allow writes, so you'd have multiple authorities on what is true. This may or may not be a problem, depending on what your application is doing, but if you write conflicting data you may get multiple results the next time you read that key/row/column which you'll have to merge with application logic and save back to the database. (further reading: CAP-theorem, CouchDB replication, Riak replication, consistent hashing, Bitcask & StormDB, Quorum- w/ MongoDB on network split, merge resolution strategies)

Consensus-based databases with replication across nodes, such as Scalien would always be consistent on writes, but at the cost of exchanging multiple messages before ACKing the write. This isn't much of a problem if you have a fast ethernet and you don't need to write to disk before ACKing, which you won't need if your minimum of three servers are on different server racks with separate power supplies (one dies; the other two makes sure they've saved on disk). (further reading; PAXOS, PAXOS COMMIT, two-phase-commit with distributed transactions, three-phase-commit)

Misc further reading: (book: 'Elements of Distributed Computing', vector clocks, version vectors, matrix vectors, logical clocks, bakery algorithm, interval tree clocks, actors and reactive programming and reactors, software transactional memory, transactors, AKKA, Stact, fallacies of distributed computing, gossip protocols, Cassandra's anti-entropy gossip protocol extensions, distributed hash tables, papers on merging data in a distributed setting, ZooKeeper architecture, InfoQ-presentation on "asynchronous protocol", HBase architecture, MapReduce paper, Amazon Dynamo paper that started all NoSQL-stuff, queueing, rabbitmq high-availability clustering)

I hope I gave some food for thought :). You can follow me on twitter @henrikfeldt if you want tweets about this stuff, too.

Henrik · Answer 5 · 2012-02-12 20:26:17Z

OK, so here's another viewpoint on scalability.

Let's discuss what it means for things to be data, what it means to have behaviour and what it means to have application logic.

Normally, when one ventures into the land of enterprise applications and the like, one would have exposure to the idea of layering. Of course, layering is all over the place in computers, such as in the networking stack (ISO model), or in graphics (Photoshop), or in SOA (services may call siblings or children, but never parents).

However, the specific type of layering that has been abused with no regard what so ever is that of the 'GUI', 'Business Logic Layer' and then 'Data Access Layer'. I mean, yeah, the idea is good in principle, like communism is good in principle, but in reality it's not.

Let's have a look at why. The argument I'm going to use is about coupling; points from one layer that touches points at another layer. Whenever you start creating a n-tier aka layered app in the default-enterprisey-mode that people go into, they create sooo many points of contact between the layers.

In its core, the idea is that layers are interchangeable; but they are not! Why? Because of all the call-site coupling.

Instead, have a look at why network is decoupled! Because the interface is a byte-stream over a single file pointer that points to an open socket! All of the layers in the ISO models are like what the design pattern called 'chain of responsibility' is to object orientation! Each layer wraps the underlying layer, without knowing the semantics of the data in that underlying layer.

As a package of data walks towards ethernet and raw electrical signals at the bottom it gets continuously wrapped by layers that only know of its own specific message envelope, its own specific 'batch of bytes' that it can send; and nothing else. It does not need to alter call-paths depending on the contents of the package.

Contrast this to n-tier where you would have to alter call-path in your application layers on a 'call' traversing your layers on its way to the database - for example, 'gold customers' are polymorphically a superset of 'normal customers' and so because we use 'table-per-subclass' we need to know about this now that the data (entity) is traversing the layers; both in the so called 'business logic layer' and in the data layer which is actually doing the saving.

It's neither scalable nor optimal from a computing perspective.

Why is it not scalable? Because the architecture is coupled, and then you're still inside the same old DB that you were trying to scale out to many nodes! But, because you need ACID for this, that and a third entity (data object) you need to have them in a single database that does transactions!

Righty, so with that rant out of the way; what other ways are there?

Well, there's the hated acronym called 'SOA', i.e. service oriented architecture. Of course, the Tomas Erls of the world, would have you implement all your layers but with XML and SOAP instead.

For all of the above reasons, this is the wrong way to go, because you'd be coupling yourself to those XML proxies just like you'd couple yourself to the application layers as explained above.

Instead, use messaging and let whatever implements functionality for them, listen to them. Your service surface then becomes a list of messages that you can send and you haven't coupled your operations to your service facade; and you don't even need to know what application or endpoint implement these operations, because all that you're doing is publishing a message that some other routing mechanism will route to the correct consumer!

Because you have decoupled the service facades from the actual operations that you want to perform, you can now add multiple services; in fact, this is how Netflix does it. Have a look at these presentations: http://www.slideshare.net/adrianco/global-netflix-platform. http://www.slideshare.net/adrianco/global-netflix-platform. They're good!

Dibbeke · Answer 6 · 2012-02-14 12:00:49Z

up vote 0 down vote

There is a new SQL (ACID) database in beta that is claimed to have elastic scaling properties. There is a free beta program going on now and I suggest you have a look, it's called NuoDB.

Apparently it easily outperforms MySQL even on a single threaded machine, but scales happily to 70+ instances in certain benchmarks.

answered Feb 14 '12 at 12:00

Dibbeke
2,043159

A single thread? How is it then a relevant benchmark? – Henrik Feb 14 '12 at 12:27

add a comment |

asked	2 years ago
viewed	2585 times
active	2 years ago

current community

your communities

more stack exchange communities

Interacting With Data Using Multiple Databases/Servers

6 Answers 6

Your Answer

Not the answer you're looking for? Browse other questions tagged php sql or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Interacting With Data Using Multiple Databases/Servers

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged php sql or ask your own question.

Related

Hot Network Questions