Archive for the 'Operations' Category
Tuesday, December 2nd, 2008 by: M Linden
Greetings everyone!
Over the last several months we’ve been hard at work making Second Life more relevant, more usable and more reliable. Our work is showing up in Second Life’s usage statistics. On Sunday of this past weekend, we hit another concurrency high of 76,946 and yesterday log-ins for the previous 60 days crossed the 1.4 Million mark.
What have we been up to?
Reliability is a top strategic focus for the Lab. In October, FJ Linden described how we are launching LL Net (our private fiber optic ring connecting our data centers) to provide additional redundancy and eliminate our reliance on VPNs. I am happy to report that this project is ahead of schedule and other improvements are underway.
(more…)
Hello, I’m Frank Ambrose, the Senior VP of Global Technology, and I’d like to take this opportunity to let you know about some of the work we’re doing on the Second Life Grid.
By way of introduction, I’m a recent hire here at the Lab, having joined to lead our global technology team. Specifically I’ll be focused on grid infrastructure and our stability initiatives. As noted in the press release, I come to the Lab from many years at AOL (and prior to that MCI), where I experienced the kind of explosive growth, global scale and inherent stability challenges we face here at Linden Lab.
More than anything else, my tenures at those companies taught me the direct relationship between platform stability and user experience. I’m looking forward to applying that lesson, and a host of others, as we work to maintain, build and improve this complex virtual world. I am keenly aware of the pain that any service outage can cause and am both excited and confident that Linden Lab has focused the right resources to achieve this critical objective.
Given the complexities in our architecture, our stability efforts span many individual areas, most of which were detailed by Ian Linden’s May posting. Some areas will be addressed through short-term initiatives, while others will require significant re-architecture, software changes and new physical hardware. Throughout it all, we’re committed to making the transition to a more stable world as seamless and transparent to you as possible. To that end, members of my team will be using the blog regularly to provide updates on plans and progress towards meeting our stability goals.
As part of our wider stability plan, we’re targeting 4 major infrastructure points both with long-and short-term goals: Intra-Grid Network, Asset Storage Cluster, Central Databases, and Host/Transit Data Services. The strategy is to develop and deploy near-term solutions to improve stability, while looking more broadly at our architecture (hardware, software, networks, etc). In the near term we’ve got a number of projects in flight to address some of these problem points. A couple of examples are:
- Asset collection. We’re collecting many assets that are on our storage clusters, but are rarely (if ever) accessed. These assets take up critical space on the clusters and potentially degrade performance and stability as we hit volume thresholds. We’ll be moving these files to different storage mechanisms and, while they will still be easily accessible, it will help us to avoid pushing the limits of our existing storage clusters, while still preserving all existing assets in a reliable storage environment.
- Reducing the need for VPN connections. Since we don’t encrypt communication between simulators and our databases, there needs to be a safe means to communicate across data centers and so we use VPN connections. The connections don’t scale well and can be unreliable, so establishing a new communications mechanism, that is both safe, scalable and reliable, is another short-term project.
These projects are just a sampling of the work that is currently being done to improve stability, and I’ll be reporting on their progress, as well as other short-term projects, in the coming months.
We have a lot of work to do but be assured that we have the right resources and internal focus to achieve our stability goals. From personal experience, I’ve encountered many equally complex challenges, especially in my time at AOL, and these problems are all solvable with the right level of attention and technical talent. We certainly have both, now we will start delivering.
So, you know a lot about Second Life, right? You’ve got ideas, big ideas, and are tired of waiting around for Linden Lab to make it just so. The challenge of stabilizing, maintaining, and extending the Second Life Grid, and the dozens of services and thousands of machines it entails, excites you. If this describes you, you should check out our job listings!
Three of the prioritized positions are web developers, production operations engineers, and production operations developers, but many other openings are currently available.
(more…)

We’ve just updated the Quality Metrics page, and the numbers show what you already know: April was not a good month for Second Life Grid availability. Our internal outage tracking tool estimates that about 630,000 usage hours were lost to global system failures over the course of the month, which is about 1.9% of the total (up from 0.06% in February and 0.22% in March), and resident surveys clearly indicate great unhappiness coinciding with these failures. (We define lost usage as how much time Residents would have spent logged in but did not, due to Grid failures; it is meant as a global availability metric and does not cover local failures like sim crashes, inventory problems, and the like. See actual[black] vs predicted[blue] concurrency graph excerpt, right.) I’d like to address the causes for this, and what we are doing about it in general terms.
(more…)
Linden Lab Production Operations has open positions for Production Operations developers and systems engineers in Australia, Singapore, the United States, and United Kingdom. The Production Operations team is responsible for ensuring that the Second Life grid, the world’s largest collaborative real-time development environment, is up and running.
Linden Lab Operations is a Debian Linux shop. We rely extensively on OSS, and our in-house systems are usually written in Python or PHP. Our team is made up of folks who have been involved in large-scale grid management and site operations for years.
We’re looking for people who can rapidly pinpoint and diagnose network failures, deployment issues, and performance bottlenecks, who can also create tools which will improve grid stability. Production Operations works extensively with the Concierge, System Engineering, Governance, I-world, and Development teams to triage and respond to grid problems; therefore, the ability to communicate effectively with techies and non-techies is critical. The successful candidate will have substantial *nix experience and script-fu, familiarity in managing large system installations, and no fear of complex, dynamic systems.
If this sounds like you, please click here and submit your resume for one of the “Production Operation” postings (Developer or Systems Engineer).
[07:54 AM - Resolved] The database upgrade has now been completed. Thank you for your patience whilst this was going on. – Matthew
[07:05 AM - Update] The database upgrade is now under way. – Matthew
As part of a plan to increase the performance and stability of Second Life, we will be upgrading one of our central database systems on Wednesday, April 9th between 7:00 a.m. and 8:00 a.m. As a side affect, the following services will be impacted or disabled:
Second Life Functions:
- Logins
- Teleporting
- L$ Transactions
- Profiles
This upgrade is one more step towards improved performance and reliability of Second Life. We appreciate your patience during this hour.
[Completed 4:30 p.m. PST - Kate] Operations has concluded changes and restoration of group and profile services. Please clear group cache via the debug/Advanced menu or relog to have group and profile services restored.
As part of a plan to increase overall stability of the grid today during peak usage hours, our operations team will make some changes at approximately 1:00pm Pacific which will reduce overall database load and create a more reliable experience for everyone. As a side effect of these temporary changes, some group and avatar profile services will not be available.
Specifically:
- Avatar profile information will not be transmitted to the viewer. This affects both floating and embedded profile windows.
- General group information (name, charter, etc.) will not display
in floating or group embedded group info windows.
- Groups will not show their member lists.
- Group owners and officers will not be able to eject group members.
- Group proposals will open the UI, but will fail to create.
- About Land will show 0 for traffic. (Please note: this is temporary, and impacts only the display of traffic, not the recording of it.)
Please note that all of these effects are temporary and will return to normal behavior when we re-enable these services later today.
We will update this blog post to indicate when these services are re-enabled.
Over the next week or two, we will be making some changes to the database cluster that we believe will significantly reduce the effects of peak loading that many of you have experienced over the past several weeks. The mitigating measure we’re taking above is something we will only use until those permanent changes are in place.
Thanks for your patience as we work to improve the Second Life experience..
[Completed 4:30 PM PST] Operations has concluded changes and restoration of group and profile services. Please clear group cache via the debug/Advanced menu or relog to have group and profile services restored.
Update 12:13 PM PST: We will begin these group changes earlier than originally expected and will commence momentarily.
As part of a plan to increase overall stability of the grid today during peak usage hours, our operations team will make some changes at 1:00pm SLT that will reduce overall database load and create a more reliable experience for everyone. As a side effect of these temporary changes, some group and avatar profile services will not be available.
Specifically:
* Avatar profile information will not be trasmitted to the viewer. This affects both floating and
embedded profile windows.
* General group information (name, charter, etc.) will not display
in floating or group embedded group info windows.
* Groups will not show their member lists.
* Group owners and officers will not be able to eject group members.
* Group proposals will open the UI, but will fail to create.
* About Land will show 0 for traffic.
Please note that all of these effects are temporary and will return to normal behavior when we re-enable these services at approximately 4:30pm SLT. At that time you should either relog or run Client -> Clear Group Cache (a Debug option) in order to refresh group behavior.
We will update this blog post to indicate when these services have actually been disabled, and again when they are again re-enabled.
Over the next week or two, we will be making some changes to the database cluster that we believe will significantly reduce the effects of peak loading that many of you have experienced over the past several weeks. The mitigating measure we’re taking above is something we will only use until those permanent changes are in place.
Thanks for your patience.
[RESOLVED 04:21 AM PST] The faulty server is back in line, and you should see no more problems.
*****
We have seen reports that a number of our residents see inventory related problems, such as slow or no loading, problems in picking things up etc. Those same residents might find themselves unable to log back in if they left Second Life.
The underlying problem here is one of our asset servers. Our Ops Team is aware of the situation and working to resolve it as quickly as possible.
[UPDATED 2:11 p.m. Pacific --teeple]
The tests are over. Please remember to relog or clear your group cache to restore normal group functionality. Thanks!
In order to test load mitigation strategies, the Operations Team will be disabling multiple in world functions for 30 minutes, starting at 1:30 p.m. Pacific
Specifically:
- Profile information will not load. This affects both floating and embedded profile windows.
- General group information (name, charter, etc.) will not display in floating or group embedded group info windows.
- Groups will not show their member lists.
- Group owners and officers will not be able to eject group members.
- Group proposals will open the UI, but will fail to create.
- About Land will show 0 for traffic. This is temporary.
At the conclusion of the test, you’ll need to either relog or run Client -> Clear Group Cache (a Debug option) in order to refresh group behavior.
We will conclude these tests as quickly as possible, and apologize for the inconvenience they cause.
|
9