One of my areas of expertise is in mass data transfers using API’s, FTP, WebDAV or SOAP connections. In this post I’m going to talk about techniques that can be used to increase the throughput of accessing data from a Web API and putting it in a database. I’m also going to discuss issues that occur and how to get around them.
Accessing Data From the Web
When data is read from a web-based API, it is typically limited to small bite-sized chunks of data. These chunks can be in the mega-byte range, but typically a web connection has a time limit. In many applications the data is divided by customer or account number or some other type of data division. This requires your application to access the web API many times to get the full set of data or all the updates.
Let’s say for instance, you want to create a windows service that connects to a Web API and downloads data for 1000 customer accounts. This process will be performed hourly to keep the customer data up to date (I’m making this example up, but it’s similar to some of my experience). If you were to create a program that loops through and access the API for each account your timing would look something like this:
As you can see there is a significant amount of “wasted” time waiting for the connection to occur. If you are downloading your data into a temp file and then inserting it into a database, you also have the time taken to save to a temp file before database insertion starts.
The best way to increase the speed of your download is to apply multiple connections in parallel. This will cause your timing to overlap so there are some connections that are establishing and some that are downloading. In the case of using a temp file as I just mentioned, you can also get an overlap of threads that are saving to temp files while one or more is inserting into a database. The timing resembles something like this:
If your process was performing nothing but inserts into a database, you would not gain much by parallelizing the process. This is due to the fact that your database can only insert data so fast and that would be your bottleneck. With an API connection, the bottleneck could be your network bandwidth, the speed of the remote web server, your local server running the service or the insertion speed of your database. Parallelizing the process will fix your problem with the connection wait time and that may be the only time you can take advantage of (assuming your network bandwidth is maxed when you are downloading data). To find out what your optimum speed is, you’ll have to run a few tests to see how fast you can download the same data using varying numbers of threads.
Here’s what you’ll typically see:
If you do a quick and dirty Excel chart, it should look like this:
As you can see the speed will plateau at some point and you’ll need to make a decision on how many parallel processes are enough. Typically, the choice of how many parallel processes will be decided by how much trouble you’re having with getting the data reliably. Most of the time, if this involves a database, it’ll be the frequency of deadlocks that you’ll run into.
Decoupling the Tasks
Another possible solution you can use to possibly increase the throughput of your system is to decouple the process of downloading the data into temp files and importing it into your database. You would write two windows services to perform separate tasks. The first service would connect to the web servers using parallel processes and create a directory of temp files that would be used as a buffer. You would need some sort of cut-off point, total megabytes used or maybe number of files downloaded. Your service would do nothing more than read the data and create temp files. The second service would read any temp files, insert the data into the database and delete the temp files. You can multi-thread this application or maybe you need to keep it sequential to avoid database contention. This service would check the temp file directory at a set interval and process all files until completed when it detects data to import.
Once you begin to use parallel processes, you’re bound to run into deadlocks. I typically try to increase the number of processes until I run into lots of deadlocks, then I attempt to fix the deadlock problem, followed by backing down the number of parallel processes. Deadlocks are difficult to troubleshoot because they are an asynchronous or difficult to repeat reliably.
To reduce deadlocks you’ll have to be aware of how your data is being changed in your database. Are you performing only inserts? If there are deletes, then you’ll need to limit the number of deletes per transaction. SQL Server typically escalates deletes to a table lock if you are attempting to delete more than 500 records at a time: How to resolve blocking problems that are caused by lock escalation in SQL Server.
Here is another good resource of deadlock resolution: INF: Analyzing and Avoiding Deadlocks in SQL Server
If your deadlocks are occurring when performing updates, you might want to index the table to use rowlocks. Adding a rowlock hint to your queries might help, but it is only a “hint” and does not force your queries to follow your desired locking plan. Don’t assume that a table locking hint will solve the issue. Also, the nolock hint does not apply to inserts or deletes.
One last method to get around persistent deadlocks, though I normally only add this as a safety net, not to “solve” a deadlock problem: You can create code to retry a failed operation. If you perform a retry, you should use a back-off escalation technique. What I mean by this, is that the first error should cause your thread to wait one second and retry, then if it errors again, wait two seconds, then the next error would wait four seconds and so on. You’ll also need to account for the possibility of a complete failure and stop retrying.
One other aspect of this program that you’ll have to anticipate is some sort of communication failure. The remote web servers might be down, your internet connection might fail, or possibly your database is off-line. In each of these situations, you’ll probably need a contingency plan. Typically, I reset the schedule and exit. You can get sophisticated and check to see if all your threads have failed, and then exit and set your scheduled update to an hour from now, or a day from now. This will cause your system to stop retrying over and over, wasting resources when the remote web servers might be down for the weekend.
This is one aspect of software that seems to be overlooked a lot. You should probably log all kinds of data points when creating a new program. Then remove your debug logging after your software has been deployed for a month or two. You’re going to want to keep an eye on your logs to make sure the program continues to run. Some problems that can occur include things like memory leaks or crashes that cause your service to stop completely. You’ll want to log the exception that occurs so you can fix it. Your temp files might also fill up the hard drive of your server (watch out for this). Make sure any temp files that you create are properly removed when they are no longer needed.
When I use temp files on a system, I typically program the temp file to be created and removed by the same object. This object is typically called with a “using” clause and I incorporate an IDiposable pattern to make sure that the temp file is removed if the program is exited. See one of my previous posts on how this is accomplished: Writing a Windows Service with a Cancellation Feature.
Windows Service Cancellation
One of the first things I do when I write a data updating application is identify how I can recover if the application crashes in the middle. Can I cleanup any partial downloads? Sometimes you can set a flag in a record to indicate that the data has completed downloading. Sometimes there is a count of how many records are expected (verify this with the actual number present). Sometimes you can verify with a date/time stamp on the data to download vs. what is in the database. The point is, your program will probably crash and it will not crash at a convenient point where all the data being downloaded is completed and closed up nice and neat. The best place to perform a cleanup is right at the beginning when your program starts up.
After you get your crash-proof data accuracy working, you will want to make sure that your service application can cancel at any point. You’ll need to make sure you check the cancellation token inside every long processing loop. If you are inside a Parallel.Foreach or some such loop, you’ll need to perform a loopState.Break(), not a return or break. Make sure you test stopping your service application at various points in your program. Some points might take some time (like a large database operation). Most service stop requests should be instantaneous. Getting this right will help you to stop your process clean when you are testing after you deploy your system to a production environment.
Testing a New System
Normally a company that provides an API will have test connections available. However, this might not be enough because the test data is typically very small. If you already have accounts and you are allowed to test with real data (for example, the data you are working with is not classified or restricted), then you can setup a test database with a connection to actual data. If you are not allowed to access real data with a test system, you might need to create a test system.
In most of these instances I will dummy out the connection part and put text files in a directory with data in each text file representing what would be received from an API. Then I can test the remainder of my program without an API connection. A more thorough test setup would involve setting up an API on your local IIS server. You can write a quick and dirty C# API to read data from a directory and spit it out when it is requested. Then you can test your application with data that you can generate yourself.
Make sure you test large data sets. This is something that has bit me more than once. Normally you can get a pattern for your data and just write a program to generate megabyte sized files. Then test your program to see if it can handle large data sets and record the timing difference between smaller data sets and larger ones. This can be used to give an estimate of how long it will take your program to download data when it is deployed (assuming you know the files sizes of the production data).
I created this blog post to give the uninitiated a sense of what they are in for when writing a service to handle large data downloads. This post hardly touches on the actual problems that a person writing such an application will run into. At least these are some of the most common issues and solutions that I have learned over the years.