In this post I’m going to talk about database application scaling. For this discussion I’m going to assume the reader is working (or owns) a small company that has suddenly increased in capacity. I’m going to further assume that the goal is to be able to scale up to a larger size and performance without replacing everything at once.
The problems of Scaling
The number one problem that I have run into regarding scaling is the design of the software. The software I’m talking about is a custom application built by the company a long time ago when the goal was to sell a product (not ensure that the product would scale). Legacy software that I have dealt with seems to fit the category of poorly designed and built by a person (usually only one person) with only a basic knowledge of databases and software design. As a company increases capacity, the normal and cheap way to scale up is to buy new hardware. Why? Because hardware is much cheaper than the wages of developers required to rewrite, and time is usually a critical factor. Eventually the company will arrive at a point where new hardware cannot help or becomes prohibitively expensive. This is where this blog post comes in.
One of the easier solutions is called potting. This is where you divide your data into two or more databases. Then you need some sort of mechanism in your code to decide which database to retrieve the data. If your customers are numbered, you can do something clever, like even numbered customers in one database and odd numbered customers in another database. Then you’ll need a mechanism to split your data from one database to another. The coding of the initial mechanism is relatively simple (depending on how your legacy code is organized). Creating scripts or programs to divide your accounts can be a bit more difficult. Obviously, your common data will be duplicated (unless you want to move that to a third database) and you’ll need to have a mechanism to handle saving the duplicate data from one to another (maybe scripts to copy from one database to the other, or change your administrative software to save to both).
As you are contemplating your expansion, you’ll need to think about two things. Where is your actual performance bottleneck and if splitting by two will nearly double your performance, should you leap ahead and split into four? Let’s pretend your database resides on one physical server and not a virtual server. Let’s also assume you have at least 4 open drive slots available. Now your choice is between buying 2 hard drives (large enough for a copy of the database on each) or 4 hard drives. You’ll need to install a separate database on each drive to gain performance. You can use the same MS SQL server (or Oracle server, MYSQL, etc.) and attach all the databases to the same server, this will simplify copying data between databases.
Making a decision to go with four databases instead of two at this point will save you headaches in the future if you need to scale further. The reason is that you can create your software to use a modulo 4 arithmetic to decide which database each account is assigned to (instead of even/odd). Most of your developer time will be sunk into writing an app to divide the database, so you might decide to go directly to four now and skip refactoring your data converter to divide your future two databases into four.
This begs the question, how many ways can I divide my database? That depends on how much time you have to get your system scaled, and what hardware resources you have. If your server cannot house 8 hard drives or you want to upgrade to multiple servers, you’ll have to account for hardware purchases and database licenses (unless you are lucky to have legacy software that already runs on a free database system).
Before you embark on an expedition of splitting your data into multiple database, you should do some experimentation. First, you need to be absolutely sure of where the performance bottleneck is located. If you’re running a web-based system and your real bottleneck is in your web server, all the database potting will just cause you more work with no gain in performance. You should also analyze your server memory usage and hard disk technology. Can you increase your server memory, replace your hard drives with hybrid or SSDs? Is your network or internet connection speed maxed out? It’s safe to assume your database CPU’s are not maxed, that is a rare and unusual situation. What about your software, is there any low-hanging fruit for increasing performance? Analyze your database structure and determine if you have indexes on all foreign keys, order by fields and fields that show up in your “where” clauses, “group by” clauses, etc. You can analyze query performance by using the “Estimated Execution Plan” tool built into MS SQL Server Management Studio:
You’ll see an output similar to this:
This is an example that it too simple to see the benefit of this tool, but you can execute a large and complex stored procedure and this tool will break down which part of each query is costing you the most time. Focus on the big ticket items first.
Time to Get Serious
I’m going to assume you have already upgraded these systems, you have tuned your database to be as fast and efficient as possible and you are to the point that your database is still a bottleneck.
MS SQL server has a few nice tools that I use a lot (as in, all the time). One such tool is the “script to” feature that is available for any object in the database. You can script the entire database to create a new database. Then search and replace the names to create a different database inside your SQL Server. For my example, I’m going to use the script tool to create 4 student databases called StudentData0 through StudentData1. You should pay attention to the filename portion of the database create script. You’ll need to change each database to point to your separate hard drives. If you make a mistake here and need to fix it later, that is easy. Just “detach” your database, copy the physical file to the other database and “attach” the database.
One other thing you need to consider in this endeavor is that you’ll need a tool to make sure your four databases are in sync. In other words, all databases need to have the identical table structures, keys, indexes, stored procedures, etc. If one database is out of compliance, then your software will suffer. You can purchase an off-the shelf tool or you can write your own tool depending on your SQL Server sophistication.
MS SQL Server has an automatic indexing scheme called “identity”. What this does is it creates a special trigger on a table that will auto-increment the primary key number by one when an insert operation occurs. There are some parameters in the identity definition that can make things easy for us. First, there is a seed value. That is the starting number. Normally, this is set to zero, but we’re going to set the seed to 0 for database 0 and 1 for database 1, etc. Next is the “Identity Increment”. This determines the increment value of the identity field. Normally, this is set to one, but we’re going to set this to 4 for each table inside our databases. The “TRUNCATE TABLE” command will reset the table to the starting seed value.
Now, when we insert a new student into a database, database 0 will contain student numbers that are equal to the modulo 4 equal to zero and database 1 will contain student numbers that are equal to modulo 4 equal to 1, etc.
Using an ORM
I’m going to use NHibernate for this example to show how to tweak the session factory and how to insert data into four databases. Here’s the sample code:
public class QuadSessionFactory
private static ISessionFactory _sessionFactory = new
private static ISessionFactory SessionFactory(int
if (_sessionFactory[databaseNumber] == null)
Catalog=StudentData” + databaseNumber +
.Mappings(m => m.FluentMappings.Add<StudentMap>())
SchemaExport schemaExport = new
public static ISession OpenSession(int databaseNumber)
You’ll notice that this code is nearly identical to the session factory that I normally use in my demos. All I did was turn the local variable “_sessionFactory” into an array of 4 variables and refactor my getter into a method call with the database number as a parameter. Now I can use my OpenSession method to open four different sessions, like this:
using (ISession db = QuadSessionFactory.OpenSession(dbNumber))
// query code here
Now you’ll have to get clever when you select data. Why? Because your data is spread across 4 different databases and you can’t be sure which student is in which database, unless you are searching by student number. One solution is to query all 4 databases and then sub-select. Another, more efficient method, is to use the same query to sub-select from each database and then concatenate the results at the end (and sort if desired). Oh what fun.
Inserting data will involve a scheme to distribute your data. You could choose a random number to select which database to insert into or you can count the number of records in each database to determine which one contains the least number of active records (that will cost you some processing time). Another method is to keep a table in database zero with an identity field in it for the student table. Insert into this table first to get the next number and take the modulo 4 of that number. This will also cost you some processing time and you’ll have to keep different tables to give you the next number for each table of data that you want to distribute. My choice would be the random number modulo 4. Simple and fast.
No matter how you slice your data, there will be some reference tables and data that is used by your application that should not be split between databases. You can’t just split this data and you can’t just have one copy in one database. These tables will form your relational queries (like student standing, freshman, sophomore, etc). When you query each database, for student and standing, you’ll need to join these records together in each database. This will require duplication. This is OK, since most reference tables are rather small. The space that these tables will be wasteful but hard drives are cheap and the price of performance is expensive.
You’ll have to take steps to keep your databases in sync. You can loop all your CRUD operations to save your data into each database after a change occurs. Another method is to change data in one designated database (let’s say database zero to keep it simple) and trigger a stored procedure to sync data to other databases. Which result you decide to use may depend on how fast you need all databases to be identical.
Another solution to this mess is to extract all the common data into its own database. This will break any relational integrity that you have setup but it will simplify the operation of keeping your common data in sync. One thing to keep in mind is that this one common database might become your bottleneck undoing any work you put into splitting your primary data into 4 databases.
The task of splitting your database into parallel databases is not an easy one and this blog post is nowhere near enough information on the subject to completely split your data. You’ll have to do more research. Your first task should be to setup a sample system that has multiple databases with your data, then point your development systems to this setup and determine what will break and what will work. You might get into a rats nest that is a bit too deep. Creating a simple and cheap test system is the place to make this discovery.
You can download the sample code here:
You can download the database create script here: