I’ve worked for a lot of different companies. Most of them small. Several of the companies that I have worked for have had some serious growth in their user base. Every company I have worked for seem to follow same path from start-up to mid-sized company. Start-ups usually staffed by amateur programmers who know how to write a small program and get it working. Inevitably the software becomes so large that they are overwhelmed and have no clue how to solve their deployment problems. Here are the problems that they run into:
- The customers become numerous and bugs are reported faster than they can fix them.
- Deployments become lengthy and difficult. Usually causing outages after deployment nights.
- Regression testing becomes an overwhelming task.
- Deployments cause the system to overload.
- Keeping environments in-sync becomes overwhelming.
This is where continuous integration techniques come into play. The first problem can be tackled by making sure there is proper logging of system crashes. If there is no log of what is going on in your production system, then you have a very big problem.
Problem number two is one that can be easy to solve if it is tackled early in the software development phase. This problem can only be solved by ensuring everyone is on-board with the solution. Many companies seem to double-down on manual deployments and do incredibly naive things like throwing more people at the problem. The issue is not the labor, the issue is time. As your software grows, it becomes more complex and takes more time to test new enhancements. Performing a scheduled deployment at night is a bad customer experience. The proper way to deploy production is to do it in the background.
One method of performing this task is to create new servers to deploy the software to and test the software before hooking the servers into your load-balancer. The idea is to automate the web server creation process, install the new software on the new servers and then add them to the load-balancer with the new features turned off. The new software needs to be setup to behave identical to the old software when the new features are not turned on. Once the new servers are deployed, the old servers are removed from load-balancing one at a time until they have been replaced. During this phase, the load of your servers need to be monitored (including your database servers). If something doesn’t look right, you have the option to stop the process and roll-back.
Database changes can be the challenging part. You’ll need to design your software to work properly with any old table, view, stored procedure designs as well as the new ones. Once the feature has been rolled out and turned on, a future clean-up version can be rolled out (possibly with the next feature release) to remove the code that recognizes the old tables, views, stored procedures. This can also be tested when new web servers are created and before they are added to the web farm.
Once everything has been tested and properly deployed the announcement that a new feature will be released can be made, followed by the switch-on of the new feature. Remember, everything should be tested and deployed by the time the new feature is switched on. If you are running a large web farm with tens of thousands (or more) of customers, you may want to do a canary release. A canary release can be treated like a beta release, but it doesn’t have to. You randomly choose 5% of your customers and switch on the feature on early in the day that the feature is to be released. Give it an hour to monitor and see what happens. If everything looks good, add another 5% or 10% of your customers. By the time you switch on 20% of your customers you should feel confident enough to up it to 50%, then follow that by 100%. All customers can be switched on within a 4 hour period. This allows enough time to monitor and give a go or no-go on proceeding. If your bug tracking logs are reporting an uptick in bugs when you switched on the first 5%, then turn it back off and analyze the issue. Fix the problem and proceed again.
I’ve heard the complaint that canary release is like a beta program. The first 5% are beta testing your software. My answer to that is: If you are releasing 100% of your customers at the same time, doesn’t that mean that all your customers are beta testers? Let’s face the facts, the choice is not between different versions of the software. The choice is between how many people will experience the software you are releasing, 5% or 100%. That’s why I advocate random customer selection. The best scenario rotates the customers each release so that each customer will be in the first 5% only one it twenty releases. That means that every customer shares the pain 1/20th of the time instead of a 100% release where every customer feels the pain every time.
Regression testing is something that needs to be considered early in your software design. Current technology provides developers with the tools to build this right into the software. Unit testing, which I am a big advocate of, is something that needs to be done for every feature released. The unit tests must be designed with the software and you must have adequate code coverage. When a bug is found and reported, a unit test must be created to simulate this bug and then the bug is fixed. This gives you regression testing ability. It also gives a developer instant feed-back. The faster a bug is reported, the cheaper it is to fix.
I have worked in many environments where there is a team of QA (Quality Assurance) workers who manually find bugs and report them back to the developer assigned to the enhancement causing the bug. The problem with this work flow is that the developer is usually on to the next task and is “in-the-zone” of the next difficult coding problem. If that developer needs to switch gears, shelve their changes, fix a bug and deploy it back to the QA environment, it causes a slowdown in the work flow. If the developer checks in their software and the build server catches a unit test bug and reports it immediately, then that developer will still have the task in mind and be able to fix it right there. No task switching is necessary. Technically many unit test bugs are found locally if the developer runs the unit tests before check-in or if the system has a gated check-in that prevents bad builds from being checked in (then they are forced to fix their error before they can continue).
When your software becomes large and the number of customers accessing your system is large, you’ll need to perform load testing. Load testing can be expensive, so young companies are not going to perform this task. My experience with load testing is that it is never performed until after a load-related software deployment disaster occurs. Then load testing seems “cheap” compared to hordes of angry customers threatening lawsuits and cancellations. To determine when your company should start load-testing, keep an eye on your web farm and database performances. You’ll need to keep track of your base-line performances as well as the peaks. Over time you’ll see your server CPU and memory usage go up. Keep yourself a large buffer to protect from a bad database query. Eventually your customer size will get to a point where you need to load test before deployments because unpredictable customer behavior will overwhelm your servers in an unexpected manner. Your normal load will ride around 50% one day, and then, because of year-end reporting, you wake up and all your servers are maxed out. If it’s a web server load problem, that is easy to fix: Add more servers to the farm (keep track of what your load-balancer can handle). If it’s a database server problem, you’re in deep trouble. Moving a large database is not an easy task.
For database operations, you’ll need to balance your databases between server instances. You might also need to increase memory or CPUs per instance. If you are maxed out on the number of CPUs or memory per instance, then you are left with only one choice: Moving databases. I could write a book on this problem alone and I’m not a full-time database person.
One issue I see is that companies grow and they build environments by hand. This is a bad thing to do. There are a lot of tools available to replicate servers and stand up a system automatically. What inevitably happens is that the development, QA, staging and production environments get out of sync. Sometimes shortcuts are taken for development and QA environments and that can cause software to perform differently that in production. This guarantees that deployments will go poorly. Configure environments automatically. Refresh your environments at regular intervals. Companies I have worked for don’t do this enough and it always causes deployment issues. If you are able to built a web-farm with the click of a button, then you can perform this task for any environment. By guaranteeing each environment is identical to production (except on a smaller scale), then you can find environment specific bugs early in the development phase and ensure that your software will perform as expected when it is deployed to your production environment.
Databases need to be synchronized as well. There are tools to sync the database structure. This task needs to be automated as much as possible. If your development database can be synced up once a week, then you’ll be able to purge any bad data that has occurred during the week. Developers need to alter their work-flow to account for this process. If there are database structure changes (tables, views, functions, stored procedures, etc.) then they need to be checked into version control just like code and the automated process needs to pickup these changes and apply them after the base database is synced down.
Why spend the time to automate this process? If your company doesn’t automate this step, you’ll end up with a database that has sat un-refreshed for years. It might have the right changes, it might not. The database instance becomes the wild west. It will also become full of test data that causes your development processes to slow down. Many developer hours will be wasted trying to “fix” an issue caused by a bad database change that was not properly rolled back. Imagine a database where the constraints are out of sync. Once the software is working on the development database, it will probably fail in QA. At that point, it’s more wasted troubleshooting time. If your QA database is out of sync? Yes, your developers start fixing environment related issues all the way up the line until the software is deployed and crashes on the production system. Now the development process is expensive.
Other Sources You Should Read
Educate yourself on deployment techniques early in the software design phase. Design your software to be easy and safe to deploy. If you can head off the beast before it becomes a nightmare, you can save yourself a lot of time and money. Amazon has designed their system around microservices. Their philosophy is to keep each software package small. This makes it quick and easy to deploy. Amazon deploys continuously at a rate that averages more than one deployment per second (50 million per year):
Facebook uses PHP, but they have designed and built a compiler to improve the efficiency of their software by a significant margin. Then they deploy a 1.5 gigabyte package using BitTorrent. Facebook does daily deployments using this technique:
I stumbled across this blogger who used to work for GitHub. He has a lengthy but detailed blog post describing how to make deployments boring. I would recommend all developers read this article and begin to understand the process of deploying software:
Believe it or not, your deployment process is the largest factor determining your customer experience. If your deployments require you to shut down your system in the wee-hours of the morning to avoid the system-wide outage from affecting customers, then you’ll find it difficult to fix bugs that might affect only a hand-full of customers. If you can smoothly deploy a version of your software in the middle of the day, you can fix a minor bug and run the deployment process without your customers being affected at all. Ultimately, there will be bugs. How quickly you can fix the bugs and how smoothly you get that fix deployed will determine the customer experience.