When I posted about our efforts to improve Second Life reliability, a common response was “sure, but the proof is in the pudding.” And so it is. The first step to demonstrating our commitment is to be transparent about our success or failure to deliver on reliability promises. To this end, we’ve created a new Service Quality Metrics page, where you can find objective statistics on how well Second Life is running.
The first statistic we’ve published covers large-scale problems; this is a measure of how much usage was lost to major events affecting large numbers of Residents, such as the VPN and database problems of the past few months. This is an imperfect but consistent view of the severity of system-wide outages, but it does not cover localized events such as region and client crashes, or inventory loss. We will add more data covering those events in the future (including some metrics already available in Meta Linden’s Key Metrics, but in a more readable format).
Where do we get these numbers? Second Life is large enough now that it’s possible to predict how many Residents would normally be logged in at any given moment, so we can compare this prediction to reality when a problem occurs, and estimate how many people are unable or unwilling to use the system.
So, when looking at the service outage graph, note the big spikes in loss due to unplanned outage in July and August. I addressed some of the causes in an older post, but I’d like to provide more objective data on this. Here is the breakdown of the root causes of all unplanned downtime for the period of June, July, and August.
Clearly five main issues stand out:
* “Database Crash” is a fairly straightforward event wherein one of the critical Second Life databases crashes, leaving logins and many in-world operations blocked while the database restarts.
* “Release Overtime” occurs when a planned update takes longer than scheduled; we’ve been too optimistic with our release schedules recently and this category has grown.
* “VPN Failure” is a breakdown of a VPN link which provides connectivity between distant Second Life servers. This has been a common failure recently due to a bug in new VPN technology which we deployed and then had to roll back.
* “Network Failure” is the failure of one of our bandwidth providers to reliably move traffic between points on the Internet, which disrupts our fragile VPNs.
* “Power Outage” refers to a loss of electrical power in one of our datacenters, which is rare and happened once during this 3-month period.
As discussed in my prior post, we have imminent plans in place to significantly reduce the impact of the top four offenders, and the recurrent VPN failures have already been partially mitigated. The smaller targets will be more challenging to deal with, but we have plans for them too. Likewise, planned downtime is a consistent headache and we will continue to reduce it by improving our release process.