Frank Ambrose (FJ Linden)'s Blog

Weekend Grid Support

Friday, January 23rd, 2009 by: Frank Ambrose (FJ Linden)

Before we hit the weekend, I wanted to update you on our progress on Grid stability. We have taken a number of major steps this past week to improve database performance, which included offloading high impact queries to slave databases, and completing a master database conversion on Thursday. These actions, along with many other performance tweaks, have significantly improved query response times and reduced load on the central database.

While I am optimistic that these measures are helping to stabilize the Grid, I also want to be ready for any event contingency. We have mapped out a detailed response, escalation, and triage plan across the entire company. One of the major changes for this weekend will be the corrective steps we take if the central database becomes overloaded and unstable. Past procedures were to almost immediately block logins, to allow for load to subside and protect the database from a complete crash (which is what happened last Sunday). This is a rather drastic approach to protect the database, and we’ve now put some “throttling” mechanisms in place that would be our first step to reduce load vs. blocking logins. Throttling literally means turning features and functions off, and potentially degrading in world experience. One of these throttles would disable group queries to the central database, and would cause group features to be unavailable during that period of time. While degrading the resident experience is not a preferred step, we believe that it is far less punitive than the drastic measure of blocking logins. So if some in world features are temporarily disabled, you would still be able to log in, and enjoy other in world features that do not tax the central database.

We have an entire support team that is standing by to react to any potential problem, and will aggressively communicate with the residents, through multiple means (in world, web, blogs, etc) to keep you informed of any problems and progress to correct. While I am optimistic that none of these recovery steps will have to be taken, I think it is very important to continue proactively reaching out to the resident community on our plans.

I’ll be in the forums later to answer questions, and continue to thank you for the very constructive and helpful feedback.

Grid Emergency Maintenance

Wednesday, January 21st, 2009 by: Frank Ambrose (FJ Linden)

I blogged earlier this week in response to our grid outages over this past weekend. We have put enormous efforts into fine tuning the data layer, specifically in optimizing queries and cleaning up the data structure.

However, there is a major maintenance step that needs to be completed tomorrow. We are scheduling a 60 minute maintenance window beginning at 5:30am PST. During this window, we will be migrating our central database to an optimized slave database, and making that slave the master. In this way, we expect a significant bump in performance and wanted to take this action as quickly as possible, most especially before our highest load times over the weekend.

During the maintenance window, we will be blocking logins, but those residents who are already in world will not be bumped offline. However, there will be a degradation in performance, as access to the database will be blocked, so transactions, teleporting, asset management and actions requiring a database call will not be available. Movement within a region, chat, and voice will all remain available during the database migration.

While we are sorry about this late notification, and the need to complete this maintenance work, I believed it was important to move quickly and aggressively to address our current data stability challenges, so I advocated completing this work tomorrow. We have made some real progress this week, and this maintenance activity will begin to take full advantage of this work. Thanks for your patience.

I will be in the forums for a short time this evening if you would like to comment.

Weekend Grid Outages

Monday, January 19th, 2009 by: Frank Ambrose (FJ Linden)

Although the second half of 2008 showed a big reduction in usage hours lost to outages (we reduced outage hours by over 50%), stability challenges have increased over the past month on the Grid. This weekend was especially painful, and the first time since joining Linden Lab that I’ve experienced a full mySQL crash (this occurred just after 4pm PT on Sunday).

When the central database crashes, it takes approximately 1 hour to rebuild tables and indexes before accepting queries and becoming fully functional. This was the main reason for us to employ the painful triage process of temporarily blocking logins, while the database is in an overload state. These 5-10 minute “blocking periods” are substantially less Resident impacting than a full database crash, and a 50-60 minute restart cycle. However, neither is acceptable and I wanted to continue updating our efforts to stabilize the infrastructure.

We focus a great deal on the central database, but there are many other interdependent infrastructure components and services that also have been contributing to our stability problems. One of the positives to come out of this weekend’s outages was our ability to gather data and complete some detailed analysis of the patterns which have been causing failures during our highest load periods. In addition to having our best development and operations resources watching the Grid activity this weekend, we’ve also brought in some of the best mySQL professional services teams, to help us tune and optimize, as well as recommend long term architectural changes. As the leader of our operations and infrastructure team, my immediate priority is to tune and optimize queries to get us back to a position where we can manage our Resident transactions during peak load. This mainly focuses on validating configurations (some of which were found to be in error this weekend) and moving high load query processes that hit the central database to slave databases that have more headroom at peak (essentially spreading the load and protecting the central database).

In parallel, we have a separate engineering team that is pouring through the existing code base and developing a long term strategy for our data services that will properly scale. I’m attaching a write up below from one of our engineering leads detailing some of our efforts to re-architect the service. I’ll also be monitoring the forums and responding to your questions. As I have also said in previous posts, our execution and delivery on promises of stability are what count, but I also want to be open in communication, even if it is a difficult message to deliver.

Please join me in the forums where you can post your Grid specific questions.
(more…)

Back with the monthly grid update.  Its been a bumpy few weeks, with Level 3 outages, and central database issues.  The good news is that LLnet (data center fiber network) continues ahead of schedule and we should be starting traffic migration in the next week.  We’ve also made some headway in the area of asset storage.  Right now, central database issues are our core focus and have been at the center of most of the recent grid problems.

LLnet
The benefits of LLnet are to not only get us off of our dependency on VPN’s for inter data center traffic, but also lay the foundation for diverse internet providers that will allow us to handle an outage on a single provider (currently Level 3) and potentially improve latency.  Most of our widespread and highest impacting outages have been network related, and that is why LLnet has been my top priority since joining Linden Lab this past summer.  I expect final testing to be complete by the end of January, and production traffic cutover immediately after.

Improving Asset Storage
In the meantime, we have also been working to significantly reduce load on the Isilon storage clusters.  I know that last month I indicated that we would discuss this more and wanted to touch on our strategy with storage.  We’ve actually been working in a tiered storage environment for a number of months.  The Isilons act as our primary means of storage, for those assets that are accessed on a more regular basis.

As you may imagine, however, most assets are either accessed very infrequently, or not at all.  To determine how often assets are used, we’ve been running a detailed “collection” process.  This process identifies those rarely used (or dead) assets and moves them to bulk storage, off of the primary Islion hardware.  This is of primary importance to the stability of the Isilons, as we have been pushing the storage limits of these clusters, and a large number of assets in the “not frequently accessed” category have been taking up critical capacity.  So moving these to bulk storage will not only provide us the necessary headroom and improved reliability, it will properly place assets on the right type of storage (depending on usage).  We’ve also been using file compression on the Isilons as a “mid-tier” storage category, where we can maintain assets in the Isilons, for faster access, but minimize actual space used.

HTTP Dataserver and Agent Inventory Services
A quick update on a couple of our data access layer projects – HTTP Dataserver and Agent Inventory Services.  Both of these projects are close to completion, you may recall from my previous posts that we are trying to simplify messaging protocols between the Simulators and back end databases (HTTP Dataserver), as well as messaging from databases to the viewer (Agent Inventory Services.) Implementation is dependent on a central server code release we expect will be deployed by the end of January, followed by these two projects for release in February.  Also, expect a follow on blog post from one of our infrastructure leads, Sardonyx Linden, giving more details on our data architecture direction.

Read about the Central Database in January post
Finally, I’ve purposely not addressed our database issues, as I want to spend the January update on that component of infrastructure.  Our central database has been a source of instability the past few weeks, and we have been spending considerable time investigating root cause issues.  Given the complicated nature of the service, none of these issues have been easy to identify, but I’m expecting that we will have answers over the next few weeks, and I’ll comment on the issue in the forum thread. Please post your Grid related questions for me there.

Frank

FJ Linden here, with my monthly grid update.

It’s been a good stretch of grid stability over the last month, with one very poor day in the mix.  Some central database issues and then a Level 3 outage in the middle of the month cascaded into a series of problems, although we were able to isolate and fix them in just over 3 hours.  However, that event only served to reinforce just how important it is to bring LLnet online, and quickly.  On that topic, I’m pleased to start this month’s updates with the status of LLnet.

LLnet 30 Days Ahead of Schedule
LLnet, our private fiber optic ring, is a good 30 days ahead of schedule. This network, which will privately interconnect our datacenters, will allow us to move away from VPN reliance. “LLnet” fiber facilities have been delivered into our 3 data centers, and are currently in the configuration and testing phase with the routing infrastructure.  This work should be concluded by the end of this week, and we will then start full testing in a production environment.  We want to move as quickly as possible, but also do not want to destabilize the grid for the sake of speed, so we will take most of December to finish production testing, and begin cutover of live traffic in late December or early January.  We have thousands of machines across the data centers, so the cutover process is expected to take about 60 days, but we have been very good (so far) at beating our projected dates.

HTTP Dataserver
On the infrastructure project front, we’ve completed most of the HTTP Dataserver project to migrate all C++ mysql traffic from mysql protocol to http(s). This project will allow us to move farther away from VPN dependency as well as off of MySQL wire protocol over the WAN, to better enable tracking and monitoring of queries. We expect to be through testing in the next week.

Agent Inventory Services
Agent Inventory Services is scheduled to be deployed with the server code update in January.  This is one of the ongoing projects to address inventory issues for Residents.

These projects are both designed to provide more reliability, especially as it relates to inventory delivery and database queries, by better handling messaging across the databases and simulators, as well as back to the viewer. I intend to use my December/January post to talk about our strategy for inventory services, our storage strategy and our thoughts on our data architecture.

My primary goal has always been to improve grid stability and reliability and we are making great strides on that front. We’re not through the woods yet, but I want to re-emphasize how important I believe it is to address “foundational” issues that have the potential to cause huge impairment (like network problems), and then decide how we scale other components of the infrastructure.

Finally, I have made some internal organizational changes over the past month, that I hope will begin to drive more specialization in some key areas.  This included adding a new network director, and more focused team leads managing databases, asset management, and data services.  My belief is that, in addition to sound technical strategy, we need the right organizational alignment and specialized technical skills to achieve long term stability and scalability on the grid.

Links:

Second Life Grid Status Reports

RSS Feed for SL Grid Status Reports Page

Second Life Grid Status via Twitter

Service Disruptions Wiki page

FJ Linden here, to report on the latest Ongoing Updates from the Grid.

As I promised in my first post, this will be a regular monthly communication to keep all of you up to date on our efforts to improve grid stability and reliability. I’m finishing up my 3rd month at the Lab and have some significant progress to report.

I’m happy to report that we have an approved plan to move away from VPN reliance. We’ve finalized a design and chosen facility and equipment partners to build and deploy a private fiber optic ring to interconnect our datacenters. “LLnet” will be the designation of our private network and we have established an aggressive timeframe to activate it. I’m pushing hard to bring LLnet online by the end of this year (‘08), and begin a phased migration off of the VPN’s immediately after. Given the amount of traffic to move, I would estimate completion of this project by February or March of ‘09 at the latest. So we have a light at the end of the tunnel on one of our biggest stability issues.

(more…)

Hello, I’m Frank Ambrose, the Senior VP of Global Technology, and I’d like to take this opportunity to let you know about some of the work we’re doing on the Second Life Grid.

By way of introduction, I’m a recent hire here at the Lab, having joined to lead our global technology team. Specifically I’ll be focused on grid infrastructure and our stability initiatives. As noted in the press release, I come to the Lab from many years at AOL (and prior to that MCI), where I experienced the kind of explosive growth, global scale and inherent stability challenges we face here at Linden Lab.

More than anything else, my tenures at those companies taught me the direct relationship between platform stability and user experience. I’m looking forward to applying that lesson, and a host of others, as we work to maintain, build and improve this complex virtual world. I am keenly aware of the pain that any service outage can cause and am both excited and confident that Linden Lab has focused the right resources to achieve this critical objective.

Given the complexities in our architecture, our stability efforts span many individual areas, most of which were detailed by Ian Linden’s May posting. Some areas will be addressed through short-term initiatives, while others will require significant re-architecture, software changes and new physical hardware. Throughout it all, we’re committed to making the transition to a more stable world as seamless and transparent to you as possible. To that end, members of my team will be using the blog regularly to provide updates on plans and progress towards meeting our stability goals.

As part of our wider stability plan, we’re targeting 4 major infrastructure points both with long-and short-term goals: Intra-Grid Network, Asset Storage Cluster, Central Databases, and Host/Transit Data Services. The strategy is to develop and deploy near-term solutions to improve stability, while looking more broadly at our architecture (hardware, software, networks, etc). In the near term we’ve got a number of projects in flight to address some of these problem points. A couple of examples are:

- Asset collection. We’re collecting many assets that are on our storage clusters, but are rarely (if ever) accessed. These assets take up critical space on the clusters and potentially degrade performance and stability as we hit volume thresholds. We’ll be moving these files to different storage mechanisms and, while they will still be easily accessible, it will help us to avoid pushing the limits of our existing storage clusters, while still preserving all existing assets in a reliable storage environment.

- Reducing the need for VPN connections.  Since we don’t encrypt communication between simulators and our databases, there needs to be a safe means to communicate across data centers and so we use VPN connections. The connections don’t scale well and can be unreliable, so establishing a new communications mechanism, that is both safe, scalable and reliable, is another short-term project.

These projects are just a sampling of the work that is currently being done to improve stability, and I’ll be reporting on their progress, as well as other short-term projects, in the coming months.

We have a lot of work to do but be assured that we have the right resources and internal focus to achieve our stability goals. From personal experience, I’ve encountered many equally complex challenges, especially in my time at AOL, and these problems are all solvable with the right level of attention and technical talent. We certainly have both, now we will start delivering.