Weekend Grid Outages

Although the second half of 2008 showed a big reduction in usage hours lost to outages (we reduced outage hours by over 50%), stability challenges have increased over the past month on the Grid. This weekend was especially painful, and the first time since joining Linden Lab that I’ve experienced a full mySQL crash (this occurred just after 4pm PT on Sunday).

When the central database crashes, it takes approximately 1 hour to rebuild tables and indexes before accepting queries and becoming fully functional. This was the main reason for us to employ the painful triage process of temporarily blocking logins, while the database is in an overload state. These 5-10 minute “blocking periods” are substantially less Resident impacting than a full database crash, and a 50-60 minute restart cycle. However, neither is acceptable and I wanted to continue updating our efforts to stabilize the infrastructure.

We focus a great deal on the central database, but there are many other interdependent infrastructure components and services that also have been contributing to our stability problems. One of the positives to come out of this weekend’s outages was our ability to gather data and complete some detailed analysis of the patterns which have been causing failures during our highest load periods. In addition to having our best development and operations resources watching the Grid activity this weekend, we’ve also brought in some of the best mySQL professional services teams, to help us tune and optimize, as well as recommend long term architectural changes. As the leader of our operations and infrastructure team, my immediate priority is to tune and optimize queries to get us back to a position where we can manage our Resident transactions during peak load. This mainly focuses on validating configurations (some of which were found to be in error this weekend) and moving high load query processes that hit the central database to slave databases that have more headroom at peak (essentially spreading the load and protecting the central database).

In parallel, we have a separate engineering team that is pouring through the existing code base and developing a long term strategy for our data services that will properly scale. I’m attaching a write up below from one of our engineering leads detailing some of our efforts to re-architect the service. I’ll also be monitoring the forums and responding to your questions. As I have also said in previous posts, our execution and delivery on promises of stability are what count, but I also want to be open in communication, even if it is a difficult message to deliver.

Please join me in the forums where you can post your Grid specific questions.

Here is a more detailed view into our ongoing development efforts (authored by the Systems Infrastructure development lead – Sardonyx Linden):

When we started building Second Life, the unique nature and scale of the challenge we set ourselves posed us many difficult questions. Among our difficulties was getting to grips with our data model: we started out by writing SQL queries against a single central database, and we added tables and columns whenever we needed new functionality. This intentional lack of architecture gave us a wonderful means to bootstrap ourselves: we had our hands full creating the machinery of a virtual world, and focusing on the perfect data architecture too early would have been inappropriate.

As Second Life has grown, our data model has matured, and we are moving away from this one-database-fits-all model. There are two reasons for this. At some point, a single database (even with numerous replicas) will clearly not be able to keep up with the increasing query load. In addition, a clean internal architecture makes the system easier for our engineering and operations teams to maintain, extend, and scale.

Our existing data layout is sprawling: there are more than 100 tables in our main databases. This means that we have to be careful in choosing the order in which to reconstruct data services: we pick the busiest and most important services first. For instance, the vibrant nature of the Second Life economy generates a heavy query load, so Linden Dollar transactions are among our early targets for conversion. Developing an internal REST-based Linden Dollar API has been a substantial process. We distilled over a hundred scattered SQL queries into a small, elegant interface. We developed correctness and stress tests for the interface. We converted simulators, other daemons, batch scripts, and data warehousing tools to the new APIs. With numerous short cycles of development and testing, we ensured that the new code base stayed close to our main line of development throughout. There are still databases behind the new API, but we can partition the data and scale to accommodate heavier load without touching any of the code
that acts as clients of this API. We will be rolling out the new API on a limited scale over the coming months. Residents should see no changes as a result of this work.

We have other, similar projects underway to give us cleaner, more modular access to other critical infrastructure, such as agent inventory (“where’s my stuff?”) and space services (“what piece of the world should a simulator own?”). These initiatives will help us to provide a more stable and responsive Second Life experience, even as our user base continues to grow.

In addition, we keep a close eye on high-quality open source technologies for internal use, so that we can deploy the best for the engineers behind Second Life to work with. Sometimes, these technologies augment or replace older approaches. For instance, we have adopted Django as the framework for most of our internal web development needs. We chose Django after a comprehensive bake-off, in which we compared the performance and elegance of an application developed under several popular Python web frameworks. In other cases, we see gaps in our internal service offerings that we would like to fill, such as fast, robust messaging, and we are actively developing benchmarks and experience with contenders in those areas.

This entry was posted in Announcements & News, Grid Stability and Reliability, Service Metrics. Bookmark the permalink.