Second Life Grid Update from FJ Linden

Back with the monthly grid update.  Its been a bumpy few weeks, with Level 3 outages, and central database issues.  The good news is that LLnet (data center fiber network) continues ahead of schedule and we should be starting traffic migration in the next week.  We’ve also made some headway in the area of asset storage.  Right now, central database issues are our core focus and have been at the center of most of the recent grid problems.

LLnet
The benefits of LLnet are to not only get us off of our dependency on VPN’s for inter data center traffic, but also lay the foundation for diverse internet providers that will allow us to handle an outage on a single provider (currently Level 3) and potentially improve latency.  Most of our widespread and highest impacting outages have been network related, and that is why LLnet has been my top priority since joining Linden Lab this past summer.  I expect final testing to be complete by the end of January, and production traffic cutover immediately after.

Improving Asset Storage
In the meantime, we have also been working to significantly reduce load on the Isilon storage clusters.  I know that last month I indicated that we would discuss this more and wanted to touch on our strategy with storage.  We’ve actually been working in a tiered storage environment for a number of months.  The Isilons act as our primary means of storage, for those assets that are accessed on a more regular basis.

As you may imagine, however, most assets are either accessed very infrequently, or not at all.  To determine how often assets are used, we’ve been running a detailed “collection” process.  This process identifies those rarely used (or dead) assets and moves them to bulk storage, off of the primary Islion hardware.  This is of primary importance to the stability of the Isilons, as we have been pushing the storage limits of these clusters, and a large number of assets in the “not frequently accessed” category have been taking up critical capacity.  So moving these to bulk storage will not only provide us the necessary headroom and improved reliability, it will properly place assets on the right type of storage (depending on usage).  We’ve also been using file compression on the Isilons as a “mid-tier” storage category, where we can maintain assets in the Isilons, for faster access, but minimize actual space used.

HTTP Dataserver and Agent Inventory Services
A quick update on a couple of our data access layer projects – HTTP Dataserver and Agent Inventory Services.  Both of these projects are close to completion, you may recall from my previous posts that we are trying to simplify messaging protocols between the Simulators and back end databases (HTTP Dataserver), as well as messaging from databases to the viewer (Agent Inventory Services.) Implementation is dependent on a central server code release we expect will be deployed by the end of January, followed by these two projects for release in February.  Also, expect a follow on blog post from one of our infrastructure leads, Sardonyx Linden, giving more details on our data architecture direction.

Read about the Central Database in January post
Finally, I’ve purposely not addressed our database issues, as I want to spend the January update on that component of infrastructure.  Our central database has been a source of instability the past few weeks, and we have been spending considerable time investigating root cause issues.  Given the complicated nature of the service, none of these issues have been easy to identify, but I’m expecting that we will have answers over the next few weeks, and I’ll comment on the issue in the forum thread. Please post your Grid related questions for me there.

Frank

This entry was posted in Grid Stability and Reliability, Resident Experience. Bookmark the permalink.