We’ve just updated the Quality Metrics page, and the numbers show what you already know: April was not a good month for Second Life Grid availability. Our internal outage tracking tool estimates that about 630,000 usage hours were lost to global system failures over the course of the month, which is about 1.9% of the total (up from 0.06% in February and 0.22% in March), and resident surveys clearly indicate great unhappiness coinciding with these failures. (We define lost usage as how much time Residents would have spent logged in but did not, due to Grid failures; it is meant as a global availability metric and does not cover local failures like sim crashes, inventory problems, and the like. See actual[black] vs predicted[blue] concurrency graph excerpt, right.) I’d like to address the causes for this, and what we are doing about it in general terms.
What happened, and Why?
So, on to why April was so tough: virtually every piece of mission-critical Second Life Grid infrastructure failed catastrophically at least once during the month. Here are the biggest sources of downtime; I list these not to shirk responsibility, but to illustrate the near-perfect storm:
- Intra-Grid Network By far, the largest event was the near-total loss of intra-grid network connectivity on April 4/5, caused by a line failure within the backbone of our primary network provider – who delayed, botched, and again delayed the fix, extending what should have been a relatively brief outage for many hours… and then the failure was repeated on a smaller scale on April 25. We’re working with the provider at the executive level to address the risks that allowed this to happen.
- Central Database We encountered a new crash bug in our central database, resulting in a series of four database outages. We have subsequently identified a work-around, and should be able to avoid this particular crash in the future. In general, eliminating this cluster as a scalability bottleneck and failure point is a very high priority for Linden Lab.
- Asset Storage Cluster The asset storage cluster crashed during routine maintenance; this was a repeat of the crashes that afflicted us in December and January. The vendor had previously assured us that this problem was fixed, and so we continue to work with them to investigate the cause and potential solutions.
- Data Center/Transient Data Services Our data center in San Francisco cut power to many of our most critical servers over the course of two days. We had advance notice of this one, but working around the power cuts did cause some disruptions. This was a learning process as well, and we’ve refined our processes for dealing with events like this.
For more information on the systems that make up the Second Life Grid and how, when they fail, they impact the platform, please check the Service Disruptions page. Because component failures are never wholly avoidable, our goal (and the goal of online services everywhere), is to reduce the extent to which the Second Life Grid is affected by the loss of key systems. With a platform this technologically complex, adding the necessary redundancy and failure management is a long process, but we have not been standing still: on average a database crash in April cost about 14,000 lost usage hours, vs 53,000 hours for a similar crash last August (at a time when substantially fewer people were using Second Life). We will continue to reduce the impact (on both logins and in-world functionality) of these crashes, until they cease to be significant, at which point they can come off the Service Disruptions page.
Clearly, though, there is still a great deal of work left to do. Our long term strategy includes specific plans to eliminate the risks associated with all of the failures I listed above. We make progress every day, and continue to hire the best and brightest technical staff at our offices in San Francisco, Mountain View, Seattle, Boston, and Brighton. (See our hiring page.) In the meantime, our service record is not perfect but we are confident that we have identified some key areas to improve and will continue to move forward.
Finally, I’ll mention our new coalesced status reports page, which replaces system status updates on our primary blog. Look there for all Grid-status information, including information about upcoming scheduled outages: http://status.secondlifegrid.net/