I’m going to start a new series about the use of SQL server databases in software projects. As you can tell by my use of the word “Disorders” this is going to be about designs that are messed up. These are usually problems created by a design mistake from someone who was not an expert in databases. I’ve run into these problems so many times, that I’ve decided to discuss the problems here and hopefully someone new to the field will read these posts and have an “a-hah” moment before it’s discovered years later by someone like me.
Primary Keys
Anyone who has taken a database class knows that a table will need a primary key. Every course teaches this early on and the purpose is to provide a unique way to identify each record. This is used mostly by the delete function to make sure we don’t accidentally delete multiple records, or in many cases, are forced to delete multiple records. The other important purpose is to prevent duplicate records from being inserted (when they are probably undesired).
The easiest way to apply a primary key is to generate one. The primary key field is never seen by your customer (unless you’re using it as an account id or a similar identifier). It’s an internal number that provides a fool-proof way to uniquely identify records. This allows duplicate records to exist, but provides a mechanism to delete one of the duplicate records at a time (assuming that is what you’re attempting to do).
How do you generate the primary key? One of the most obvious ways that seems to be very common is to make the primary key an integer, then read the max primary key, increment it by one and use that in your insert statement for the data you’re inserting. This is one of the most common disorders I run into in the field, and it’s frustrating to try and fix the mess that this creates. The problem with this scheme is that there is a distinct amount of time between reading the max number and inserting the new data. Software developers who are not thinking about high loads and parallelism make this mistake and I’ve seen code written such that the max number is read early in a class method, then building out the query to insert the data and possibly doing additional computations before performing the insert. This makes the possibility that two records with the same primary key being inserted at the same time more likely. Let me show you what I’m talking about in a diagram:
In this diagram there are two users inserting data into the same table. User 1 starts the process first by reading the max primary key number, which is 5. Then User 2 starts the process. User 2 also reads the max primary key as 5. At this point the whole process is occurring in the software, not the database and it’s happening in parallel. Probably a website, but it could be a multi-user windows application sharing a common SQL Server database. The next step in the diagram is that user 1 increments the max primary key to 6, then the second user does the same. Then the insert occurs for the first user and a new record with a primary key of 6 is created in the server. Next the second user attempts to insert a record with a primary key of 6 and SQL returns an error:
If your table already contains millions of records and you’re trying to fix this problem on a live system that is under heavy load, then you have very few options. The quick and dirty fix is to put your insert into a stored procedure and wrap the whole operation in a transaction:
CREATE PROCEDURE [dbo].[InsertIntoMyTable] @Name varchar(50) AS BEGIN BEGIN TRANSACTION DECLARE @id AS int SELECT @id = MAX(MyPrimaryKey) FROM MyTable SET @id = @id + 1 INSERT INTO MyTable (MyPrimaryKey, [Name]) VALUES (@id, @Name) COMMIT END
This is obviously a contrived example, but you can see how I started a transaction, performed the read max and increment, then I inserted and committed the results. The purpose of this logic is to lock the table while computing the next id and inserting. Making the entire operation one atomic operation. Here’s a diagram of what will occur by using this scheme:
Hurray! The problem is solved!
Not so fast. First, this is what I refer to as a “Hack.” Why do I call it a hack? Notice the part where user 2 must wait for user 1 to insert their record. This is wasted time and resources. Also, if the time it takes to process an insert is too long, or the waiting list of other users to insert their records grows, SQL may time out and cause an error. You’ll need to fix this problem correctly before your traffic grows to a point where this becomes a major problem.
To fix this right, you can add an identity specification to the primary key field. Unfortunately, you can’t just change the primary key to use an identity specification on an existing table. So you’ll need to create a new table, transfer your data, then drop your existing table, then rename your table to the correct table name (the one you just dropped). If your table contains a lot of records and it will take a long time to perform this operation, or if you can’t afford any down time during your upgrade, you’ll have to get creative. I can think of a couple of solutions involving a background process to sync the records while your system is running, then a quick switch to the new table. This effort is a whole subject in it’s own right and I would prefer to design this with a system that I can load-test. Therefore, if you can afford an outage, I would recommend biting the bullet and doing it before this becomes a bigger problem.
Once you have fixed this problem, don’t forget to remove any code that reads the max, increments and uses the primary key in the insert query. There should be no primary key referenced in any inserts for a table with an identity specification. This will cause the same issue as having no identity and it will appear that you have not fixed the problem.
Avoid Generated Primary Keys
Another method of fixing this issue is to avoid using a generated primary key. If you have data that must be unique, take some time and analyze if you can use one, two or several fields as your primary key. This will not work if you need to use the primary key of your table for a foreign key in another table. There are several instances where it is desirable to avoid using a generated primary key.
A good example is when you’re using a table to join two other tables to avoid a many-to-many relationship. Here’s an ERD:
In this sample recipe database, there will be recipes and users. When a user clicks on a thumbs-up icon on a recipe, that will cause the recipe to end up in user’s “favorites” list. Since we want to avoid duplicates, like a user with the same recipe in their list twice, we want to make sure the list is unique. To do this, we can turn the FavoriteRecipe table UserId and RecipeId fields into a compound primary key. This avoids the need for a generated primary key field for the FavoriteRecipe table shown above. If you were running a popular recipe website and you have millions of users, you can also avoid running out of numbers for primary keys (another issue I have experienced). Especially for a table that probably changes at a higher rate than the user and recipe tables.
A child table, might have enough information to make each record unique, utilizing the parent foreign key, plus another piece of information. Here’s an example, continuing with the recipe database theme:
In this example, each recipe will have a list of instructions. The instructions must be displayed in order, which is an integer field. The order number needs to be sequential for each recipe and not contain duplicates. Therefore, the RecipeId field and the Order can be combined to form a primary key for the RecipeDirections table. If this is setup before the software is written, then there is the added bonus of throwing an error if the software does not correctly sequence the order numbers.
Carefully Plan Your Database First
Here’s the most important step: Carefully plan your database layout before you write your first line of code. You’ll probably need to change your database design as you get into the details of your code, but spend the time up front to try and get your tables, primary keys, foreign keys and other constraints set before writing code. If you do this, then your database will take care to ensure that the integrity of your data is correct, even if you write some bad code. You should run into problems during code create time if you’re not handling your data correctly (like inserting child records before the parent record – I’ll discuss this design “disorder” in a later article).
When I create a new database, I like to create a SQL script that can drop the entire database and create a new one. I add test data to the end of this script to insert new test data as I need it. If I need to make a change to a table (like add an identity field that I forgot), then I fix it in the script, then run the script that will drop everything and recreate it fresh. When I’m complete, I can comment the database drop portion (or remove it) and use the script to create a fresh database in a shared server (also, don’t forget to remove the test data section). The script can also be used as an upgrade script going forward by adding alter statements after the initial script. You’ll need to wrap all your create tables with an “IF” to avoid errors on tables that already exist. Scripting your database is another large subject, but you should get into the habit of creating your database using a script. Otherwise, you’ll be manually creating databases that will end up out of sync from each other.
If you catch errors when the software is new and the usage is low, they’re easy to fix. The longer the problem goes without being identified, the more difficult and expensive it becomes to fix. Eventually, as in an example I indicated earlier, the problem can consume a lot of resources to fix properly. When your website traffic is high and you’re experiencing a flood of insert errors, this is the very worse time to fix the problem, but it’s also the time when fixing the problem is critical.