One of the big challenges in running a small software startup is service availability and support. This was, perhaps, a little easier in the bad old days when software shipped in boxes and "support" meant picking up a phone between 9-5 on business days, then mailing out a CD or floppy disc (remember those?) with updated bits. Nowadays, much of the software people are building "ships" as a service running 24×7 on the web, or at least has a online service component to it. What’s more, customers increasingly reside around the world, and expect near-realtime responses, especially if their business depends on your software service. You can imagine, then, that for a 1- or 2-person dev shop, making a web service highly available to customers is a particularly daunting challenge.
There are "business" solutions you can use, including limiting your support service level agreement and hiring more support people, perhaps in different time zones around the world. These may be appropriate depending on your customer needs and your budget.
You can also throw technology at it. Here’s what we are trying to do in that vein, on 5 Blocks Out:
(1) Ship quality software. This is motherhood and apple pie, but bears repeating. Lots of software companies still imagine they can somehow cut costs by shipping half-baked code. Sorry, it doesn’t work. If you don’t invest in writing good code in the first place, you will pay orders of magnitude more for it later in support costs and frustrated customers. The only valid exception I can think of is prototype code.
I feel like we have lots more to do here, but for small fry we’re doing well thus far. We try to be thoughtful and minimalist about what we build in the first place. Then, if it gets built, it gets tested. Rails’ built-in testing facilities and various plugins like test/spec help a great deal. Capistrano (for automated deployment and rollback) and Firebug have also proven to be vital. Next on the list is eycap.
(2) Outsource work to a Web Host. For a software startup, the question is not whether to do this, but with whom. Outsourcing the heavy lifting of buying/building and maintaining server farms, managing network bandwidth, and handling some of the basic application-level services is a no-brainer. Pay as you go, and use the time saved to focus on your core competencies. If and when your businesses gets big enough, you can always pull some of the strategic responsibilities in-house.
We currently use Hosting Rails. Generally speaking, their servers run well and their rates are reasonable. I have run into reliability problems lately, though, and despite responsive support it is becoming time to consider other hosters. Slicehost, 3Tera, and the increasingly fashionable Amazon S3 are on the "must review" list. I’ve also taken a brief look at Google App Engine, and concluded its platform is too high-level for our needs.
(3) Automate monitoring. Automated watchdogs can stay up 24×7. You, my friend, cannot. Put down the Red Bull and get yourself a suite of automated watchdogs to monitor your software’s health.
We’re getting going on this now. For starters, the Exception Notification plugin has proven itself mighty useful for after-the-fact diagnosis and debugging of Rails apps. If you’re running a Rails app, I highly recommend this plugin.
Next, you need some 3rd party ping action. Troy tells me Alertra is also supposed to be a good quality monitoring service. We are currently trying out Pingdom. The idea behind these services is to hit your site every minute (or 5 minutes, or 15… you decide) from a series of servers around the world, and let you know whether and how quickly your service is responding to HTTP, pings, and so on. When something goes wrong you get notified by email and/or SMS text message. Pingdom also collects and aggregates the data so you can look at reports over time.
Happily, Pingdom has already paid for itself: immediately after turning it on I found our service was bouncing daily (well, middle-of-nightly), sometimes down for many minutes at a time. HostingRails gave us a month of additional free hosting as compensation, and promised to watch the server more closely. We’ve had zero downtime since then.
Going a level deeper is also essential: you need realtime monitoring built into the service itself, so that you can detect and resolve problems when their symptoms first surface, or even before then. Monit, God, and Munin are on my "must review" list for this sort of thing.
I’d love to hear back on other tools and tactics startup developers are using to build and run high-availability software services. Without losing sleep, that is.