Update 2007-07-22 12:30 PM : The reversion is done. The investigation begins.
Update 2007-07-22 11:56 AM : The reversion is 80% complete and should finish soon. Also, thanks for the reports (via JIRA and comments) about performance issues with sims running 1.23. We have a hypothesis that the server build erroneously included additional debug information, which could lead to memory bloat (and thus crashes) and reduced performance. We’ll be doing additional experiments later today to try and confirm that hypothesis and see if that is the cause of the problems. At this point, though, it’s not much more than a hunch.
Update 2007-07-22 10:04 AM : Okay, you probably saw this coming… Now that the 1.23.2 update is out to half of the sims, we’re seeing a greatly increased simulator crash rate relative to 1.22 – nearly 10 times as high. This was not seen during previous 1.23 deploy attempts, nor on the Preview Grid, nor was this behavior seen on the pilot roll hosts – we’re baffled. Call stack analysis doesn’t point to a single smoking gun, either.
Between sobs and gnashing of teeth, we’re going to revert the 1.23 regions to 1.22 with another rolling restart already underway (expected to complete within about 4 hours), while the dev team attempts to understand what happened.
Please direct any positive karma towards the engineers working on this. When I was a fresh new Linden two years ago shepherding releases, a planned Wednesday downtime deploy – which stretched to 6 hours – would often be followed by additional hours of unplanned downtime and subsequent patches pushed out over multiple days to address show-stopper issues with new features and regressions. And then we’d often live with critical issues until the next iteration. We now have the means to push changes out much more slowly and with limited downtime per region, and in a non-monolithic fashion. For example, the central systems have been happily running 1.23 code for several weeks now. As the code complexity and size of Second Life has increased, the system as a whole is more sensitive to changes – and we also have better monitoring. Since we value the stability of Second Life, this has made us approach the roll-out of recent releases with more caution. It certainly causes more visible action, and angst within Linden Lab, but we believe the stability and quality of Second Life is improved.
Update 2007-07-22 08:56 AM : The first-half rolling restart is complete.
Update 2008-07-22 06:00 AM : The rolling restart for the first half of the grid has begun. Today, we are deploying to odd-numbered hosts.
Update 2008-07-21 09:06 PM : the pilot roll to 304 regions has been completed. Regions which have been updated are running version 126.96.36.199647.
The “fun” (here, here) continues – the issue found that prevented the rollout of 1.23.1 has been fixed and verified, and sat on the Preview Grid over the weekend, along with a couple of security fixes for previously outstanding issues. We plan to roll out 1.23.2 this week, starting with a “pilot roll” on Monday followed by updating the reset of Second Life over Tuesday and Wednesday. See the original 1.23 blog post for a full list of issues.
Here’s the schedule in detail:
- Monday evening (PST): a pilot roll to 150 regions
- Tuesday morning, 5AM-10AM : we will deploy server version 1.23.2 to half of Second Life.
- Wednesday morning, 5AM-10AM : we will deploy server version 1.23.2 to the rest of Second Life.
As usual with rolling restarts, this is a change on the server side; there will be no required client udpates associated with this rolling restart. Regions will receive warnings starting five minutes before they are restarted. There is no way to delay the restart of a given region. Regions should restart within 10 minutes of going down. If your region stays down for more than 20 or 30 minuets, please contact support.