For the curious, I want to shed more light on the causes of the stability problems Second Life has experienced over the past four days. First, some background: the two facilities which currently make up the grid use encrypted IPSec tunnels (a form of VPN) to securely communicate with each other over the Internet. This is what allows, say, a simulator in Dallas to query our databases in San Francisco. Needless to say, if these tunnels don’t work, SL doesn’t work very well either.
On Wednesday, we experienced a failure wherein our primary bandwidth provider started dropping about 95% of our IPSec traffic bound from one site to another, but only for some of the tunnels. Other, non-IPSec, traffic was largely unaffected, and so this was a very difficult problem to diagnose and, crucially, get the provider to fix. While waiting for that, we started moving banks of simulators onto unimpaired tunnels. Once a good percentage of the SL was back up, we enabled logins, but didn’t get all the sims back up until the provider fixed the problem.
At 3AM on Thursday, we had trouble again and two failures actually overlapped. Our provider started dropping traffic again, with a different set of tunnels affected, and, around the same time, our largest tunnel failed entirely. This is apparently due to a bug in the router model we’re using, which the vendor believed was fixed but which, it seems, wasn’t really fixed. I still don’t have all the answers on that yet, but we’re now in the process of taking those routers out of service. It was several hours before the bandwidth provider fixed the traffic problem and we were able to get the failed routers back online.
While all of this has been going on, and still today, a serious bug in the Second Life software is causing 5-minute lock-ups of the communication channel between simulators and our various databases. This is a difficult bug to describe, but essentially, we see a very tiny percentage of database queries stall indefinitely after returning their data. Our software notices after five minutes and re-starts the database client, but by then some number of queries have usually been lost. This causes affected sims to occasionally (or in a few cases regularly) experience inventory problems, search problems, transaction problems, the works. Without any code changes, the bug seems to have gotten more common over the past week; on Friday evening the engineers working on the problem identified a potential fix, which we’ll be testing over the weekend.
Finally, there was an unrelated problem Friday night: a disk drive in one of our databases failed in a peculiar way, which caused it to stay in service but perform far too slowly to keep up with load. (Normally drives which fail are automatically shut down by our RAID systems.) Logins and some residents’ inventory remained unreliable while we diagnosed the problem.
The end result is a pretty rough week for Second Life Residents, and I can only offer apologies for that. The good news is that we have many projects in the works which will help improve the reliability of SL. We’re working on new infrastructure which will help us withstand the failures above, as well as store and transmit data more reliably, and we’re working on reducing both client and server crashes. Several of these projects are new, thanks to our growing technical team. We’ll share more information on these efforts in the near future.