Fault tolerant != Uncrashable

Some information about the system outage last week:

Last Thursday, around 1:30 AM SLT, the asset server crashed. The asset
server is an essential component of the SL cluster: first the residents
noticed slowdowns and missing assets (textures, avatar appearance,
etc.), then the entire grid had to be taken offline. It was a long and
painful night for the on-call responders.
The asset server is on a fault tolerant distributed filesystem. On one
hand, this makes the Thursday crash pretty mysterious. We’re not sure
exactly why it went down. On the other hand, the asset server’s never
crashed like this before, so it’s been doing a fairly good job at
surviving disk failures and the like.

We’re still working on what caused the crash and how to prevent it from
happening again. Going forwards, we’re also considering different
configurations for system-critical data storage.

~~ beez Linden

This entry was posted in Operations. Bookmark the permalink.

1 Response to Fault tolerant != Uncrashable

  1. eggy lippmann says:

    OK, sorry if I’m pointing out the obvious here…
    Are all the nodes in your asset cluster perfectly homogeneous? As in, same hardware, same software, and were they all bought within a narrow timeframe?
    This would explain why the whole cluster crashed simultaneously 🙂
    In my experience hardware has a rather narrow MTBF window. If you have two computers with similar usage patterns (proper load-balancing ensures that you do) they will often die around the same time.
    Hardware / Software monocultures additionally introduce single points of failure. If something tickles a bug at one node then it’s likely to occur in another.

Comments are closed.