The Second Life 1.18.5 Server release included updates for several systems, including new python libraries, backbones (a piece of infrastructure which handles a variety of services, such as agent presence and capabilities, and proxies data between systems), and simulators. The deploy as planned for November 6th did not require any downtime – all components could be updated live. We planned to perform the rollout per our patch deploy sequences: updating central systems one by one, then simulators.
Read on for the day-by-day, blow-by-blow sequence of events which followed…
Tuesday, November 6th
Prior to the 1.18.5 Server deploy, at around midnight (all times are Pacific Standard Time) we suffered a VPN outage to our Dallas co-location facility, which caused many regions to drop offline. The system recovered on its own after about an hour, and our ISP’s initial investigation pointed to hardware issues with the network infrastructure.
Starting at 10:00am we began the actual update of the servers to the Second Life 1.18.5 code. We started by updating the “backbone” processes on central machines one by one, such as login servers, tackling the “non risky” machines first. At 11:00am we got to the “risky” machines, which handle agent presence (i.e. the answer to “is so-and-so online?”) as well as several other key services. Closely monitoring the load on the central database (which usually shows increased load when something goes wrong) as well as internal graphs which closely track the number of residents online, we started making updates. Everything seemed to be going well.
Towards about 11:15am the various internal communication channels lit up with reports of login errors. We stopped updates of these central systems (7/8ths of the way through) and started to gather data. We have seen this problem in the past when hardware issues or bugs caused the “presence” servers to spin out of control, but this time there were no obvious failures; for unknown reasons they weren’t responding to requests from the login servers. Hoping for a quick-fix (i.e. a simple configuration change that could be applied live) we spent about 30 minutes trying to determine the cause, then gave up and rolled back to the previous code.
(Fortunately, in this case, a rollback was straightforward, and simply resulted in “unknown” agent presence for about 10 minutes. Rollbacks are not always so easy – see below!)
Simultaneously, logins to jira.secondlife.com and wiki.secondlife.com failed. These were due to the update as well (but, as it turned out, for different reasons). Once the dust had settled on the rollback it was easy to roll back one more machine to restore these logins.
Completely unrelated to the update, the database load on the central systems required us to pause the Tuesday stipend payouts, delaying the payouts for several hours. (As more and more residents have joined Second Life, and the central systems have grown busier, the time taken for stipend payouts had crept up to 24 hours. The code responsible for the process has been rewritten and the November 13th run completed in just 3 hours.)
Wednesday, November 7th
Several Lindens continued the investigation, and determined a source of the issues seen on Tuesday: the “agent presence” system was updated to use object pools to increase performance, but the number of objects in the pool was set too low. After some work, we were able to replicate this failure in test environments to verify the fix. The updated code was re-distributed to the machines making up the service, and we prepared to try again on Thursday.
(Little did we know that the insufficient object pools were merely a symptom, not the root cause.)
Thursday, November 8th
Once again unrelated to the software update, the hardware work originally scheduled for Oct 31st was finally done. Unfortunately, the addition of new hardware to the asset cluster didn’t go as smoothly as planned – as old hardware was removed the “fail over” appeared to, er, fail. From approximately 10:15am through 10:50am, assets could not be saved. This also caused login failures: when a resident logs off, the simulator needs to upload the attachments as assets before that resident can log in again, and the simulators were stuck waiting.
After the asset cluster was happy again, we proceeded with the 1.18.5 Server update. The first half of the central systems were updated by 12:00pm. We paused to ensure that the system was behaving as expected, then continued at about 12:30pm completing the updates. Shortly thereafter, as the number of online residents passed 46,000, the servers began failing in a new way. Although most of Second Life was functioning properly, many logins were slow or failed and some group chat failed as well. We diagnosed the problem as an unrecognized dependency – the central backbones were assuming that the simulator backbones would close a connection, but the simulator backbones (which had not yet been updated) were assuming the central backbones would close it instead. This wasn’t a problem in test environments or before concurrency passed some threshold, because the connections would close automatically; they would just not close fast enough to keep up once more residents were online. Once this root cause was identified (by about 2:15pm) we were able to change the code in the central backbones to resume closing the connections, since that was a faster fix. Restarting the central backbones did cause residents to appear offline for a short period of time, which was unexpected (and is being investigated).
Starting after 3pm we initiated a rolling restart to update the simulators as well to complete the update, a process which took about 5 hours. During a rolling restart, in order to reduce network traffic and load on central systems, the service is in an unusual state – regions are not allowed to move to new simulators in case of a crash. Additionally, the “geographic” restart (where regions restart in a wave traveling North to South), crash reports sent by simulators contain bogus data. (The code has been updated but old processes are still running.) This unfortunately makes detection and diagnosis of issues problematic. There was anecdotal evidence that some regions were crashing a lot, but we were unable to verify that this was not simply due to bad hardware until after the process was complete.
After the post-roll cleanup, it became clear that this was not an anomaly. A few contingency plans were discussed, including rollbacks for specific regions, but we were primarily in a data-gathering phase.
Friday, November 9th
As sleepy Lindens stumbled back into work, one incorrect (but ostensibly harmless) idea was tried; unfortunately, due to a typo, this accidentally knocked many residents offline at around 9:40am. Shortly thereafter, more testing including complete rollbacks on simulator hosts showed that the new code was indeed the culprit, but it took a while longer to identify the cause. By 12:00pm the investigation had turned up a likely candidate – and an indication that a simple widespread rollback of the code would not, in fact, be safe or easy!
The crashing was caused by the simulator “message queue” getting backed up. A server-to-viewer message (related to the mini-map) was updated and changed to move over TCP (reliable, but costly) instead of UDP (unreliable, but cheap and fast). On regions with many avatars, this would cause the simulator to become backed up (storing the “reliability” data) and eventually crash. We have a configuration file switch that allows us to toggle individual messages from TCP to UDP on the fly, but while testing we discovered a second issue – another file necessary for the UDP channel needed to be updated, and it could not be changed on the fly, and if we flipped the switch back from TCP to UDP the simulator would crash. (The TCP to UDP update on-the-fly worked, which is how we were able to do the rolling restart in the first place.)
By testing on individual simulators, we were able to confirm that by switching back to UDP the problem was eliminated, although this required stopping the simulators before throwing the switch. We co-opted an existing tool used for “host-based” rolling restarts (which had been used once in the past), and had it shut down simulators on each host (doing several in parallel), update the two configuration files, and restart the simulators. After significant testing, we used this tool to perform another rolling restart of the service, which was completed by 11pm on Friday, including subsequent cleanup.
Saturday, November 10th
Unrelated to the deploy (but included here to clear up any confusion), on Saturday at 5:20pm we suffered another VPN outage, which resulted in hundreds of regions being offline for just under two hours. The cause was due to the expiration of a certificate used for the VPN. We replaced the certificate, and our DNOC team brought the affected regions back up.
What Have We Learned
Readers with technical backgrounds have probably said “Well, duh…” while reading the above transcription. There are obviously many improvements that can be made to our tools and processes to prevent at least some of these issues from occurring in the future. (And we’re hiring operations and release engineers and developers worldwide, so if you want to be a part of that future, head on over to the Linden Lab Employment page)
Here are a few of the take-aways:
- Our load testing of systems is insufficient to catch many issues before they are deployed. Although we have talked about Het Grid as a way to roll out changes to a small number of regions to find issues before they are widely deployed, this will not allow us to catch problems on central systems. We need better monitoring and reporting; our reliability track record is such that even problem such as login failures for 1/16th of residents aren’t noted for a significant period of time.
- When problems are detected, we don’t do a good enough job internally in communicating what changes went into each release at the level of detail necessary for first responders to be most effective.
- Our end-to-end deployment process takes long enough that responding to issues caused during the rollout is problematic.
- Our tools for managing deploys have not kept pace with the scale of the service, and manual processes are error prone.
- Track date-driven work (e.g. certificate expiry) more closely; build pre-emptive alerts into the system if possible.
- Be more skeptical about doing updates while the service is live, especially when involving third-party providers