I’m Neuro Linden, one of the new team of Production Operations engineers here at Linden Lab tasked with assisting our support and operations staff with maintaining the health of the grid. I’d like to supply more information about the Second Life service problems of the past few days.
There were two primary problems: a change to the way we perform rolling simulator updates, and a hardware failure with the VPNs which allow distant Second Life servers to communicate. I’ll describe both of these failures in detail, below the fold …
We’ve recently changed the method used to deploy rolling simulator updates, in an effort to speed up and stabilize the process. Unfortunately, when we used the new system on Monday, it left many regions either shutdown, unable to report their status to our core systems, or running the old software version. The release team has identified several bugs in this new system, and are working to correct them before we use it again. Thursday’s rolling restart will use the old system.
So, after the new update method did its damage, and developers and operations staff were recovering the affected regions (at about 8pm PDT), a fault developed with two of our VPN servers. One was recovered successfully by operations, but at 11:30pm PDT, in the process of attempting to recover the second, it suffered a hardware failure, so we set out to replace it. While we were configuring the replacement machine, it suffered a ‘stall’, which required us to intervene physically, and we weren’t able to get it up and running until 6am PDT. This returned our VPN to full service and most of the affected regions came back to life.
Some regions still had issues after this time, and our developers and production operations staff worked through the morning recovering them, with the bulk recovered by around 9am PDT.
We offer our apologies to all residents affected by these problems, and I would like to re-iterate Ian Linden’s previous comments that we are working on several projects to improve the reliability of Second Life, one of which is replacing our VPN systems with more robust technology.