Second Life 1.18.5 Server Deploy Post-Mortem

The Second Life 1.18.5 Server release included updates for several systems, including new python libraries, backbones (a piece of infrastructure which handles a variety of services, such as agent presence and capabilities, and proxies data between systems), and simulators. The deploy as planned for November 6th did not require any downtime – all components could be updated live. We planned to perform the rollout per our patch deploy sequences: updating central systems one by one, then simulators.

Read on for the day-by-day, blow-by-blow sequence of events which followed…


Tuesday, November 6th

Prior to the 1.18.5 Server deploy, at around midnight (all times are Pacific Standard Time) we suffered a VPN outage to our Dallas co-location facility, which caused many regions to drop offline. The system recovered on its own after about an hour, and our ISP’s initial investigation pointed to hardware issues with the network infrastructure.

Starting at 10:00am we began the actual update of the servers to the Second Life 1.18.5 code. We started by updating the “backbone” processes on central machines one by one, such as login servers, tackling the “non risky” machines first. At 11:00am we got to the “risky” machines, which handle agent presence (i.e. the answer to “is so-and-so online?”) as well as several other key services. Closely monitoring the load on the central database (which usually shows increased load when something goes wrong) as well as internal graphs which closely track the number of residents online, we started making updates. Everything seemed to be going well.

Towards about 11:15am the various internal communication channels lit up with reports of login errors. We stopped updates of these central systems (7/8ths of the way through) and started to gather data. We have seen this problem in the past when hardware issues or bugs caused the “presence” servers to spin out of control, but this time there were no obvious failures; for unknown reasons they weren’t responding to requests from the login servers. Hoping for a quick-fix (i.e. a simple configuration change that could be applied live) we spent about 30 minutes trying to determine the cause, then gave up and rolled back to the previous code.

(Fortunately, in this case, a rollback was straightforward, and simply resulted in “unknown” agent presence for about 10 minutes. Rollbacks are not always so easy – see below!)

Simultaneously, logins to jira.secondlife.com and wiki.secondlife.com failed. These were due to the update as well (but, as it turned out, for different reasons). Once the dust had settled on the rollback it was easy to roll back one more machine to restore these logins.

Completely unrelated to the update, the database load on the central systems required us to pause the Tuesday stipend payouts, delaying the payouts for several hours. (As more and more residents have joined Second Life, and the central systems have grown busier, the time taken for stipend payouts had crept up to 24 hours. The code responsible for the process has been rewritten and the November 13th run completed in just 3 hours.)

Wednesday, November 7th

Several Lindens continued the investigation, and determined a source of the issues seen on Tuesday: the “agent presence” system was updated to use object pools to increase performance, but the number of objects in the pool was set too low. After some work, we were able to replicate this failure in test environments to verify the fix. The updated code was re-distributed to the machines making up the service, and we prepared to try again on Thursday.

(Little did we know that the insufficient object pools were merely a symptom, not the root cause.)

Thursday, November 8th

Once again unrelated to the software update, the hardware work originally scheduled for Oct 31st was finally done. Unfortunately, the addition of new hardware to the asset cluster didn’t go as smoothly as planned – as old hardware was removed the “fail over” appeared to, er, fail. From approximately 10:15am through 10:50am, assets could not be saved. This also caused login failures: when a resident logs off, the simulator needs to upload the attachments as assets before that resident can log in again, and the simulators were stuck waiting.

After the asset cluster was happy again, we proceeded with the 1.18.5 Server update. The first half of the central systems were updated by 12:00pm. We paused to ensure that the system was behaving as expected, then continued at about 12:30pm completing the updates. Shortly thereafter, as the number of online residents passed 46,000, the servers began failing in a new way. Although most of Second Life was functioning properly, many logins were slow or failed and some group chat failed as well. We diagnosed the problem as an unrecognized dependency – the central backbones were assuming that the simulator backbones would close a connection, but the simulator backbones (which had not yet been updated) were assuming the central backbones would close it instead. This wasn’t a problem in test environments or before concurrency passed some threshold, because the connections would close automatically; they would just not close fast enough to keep up once more residents were online. Once this root cause was identified (by about 2:15pm) we were able to change the code in the central backbones to resume closing the connections, since that was a faster fix. Restarting the central backbones did cause residents to appear offline for a short period of time, which was unexpected (and is being investigated).

Starting after 3pm we initiated a rolling restart to update the simulators as well to complete the update, a process which took about 5 hours. During a rolling restart, in order to reduce network traffic and load on central systems, the service is in an unusual state – regions are not allowed to move to new simulators in case of a crash. Additionally, the “geographic” restart (where regions restart in a wave traveling North to South), crash reports sent by simulators contain bogus data. (The code has been updated but old processes are still running.) This unfortunately makes detection and diagnosis of issues problematic. There was anecdotal evidence that some regions were crashing a lot, but we were unable to verify that this was not simply due to bad hardware until after the process was complete.

After the post-roll cleanup, it became clear that this was not an anomaly. A few contingency plans were discussed, including rollbacks for specific regions, but we were primarily in a data-gathering phase.

Friday, November 9th

As sleepy Lindens stumbled back into work, one incorrect (but ostensibly harmless) idea was tried; unfortunately, due to a typo, this accidentally knocked many residents offline at around 9:40am. Shortly thereafter, more testing including complete rollbacks on simulator hosts showed that the new code was indeed the culprit, but it took a while longer to identify the cause. By 12:00pm the investigation had turned up a likely candidate – and an indication that a simple widespread rollback of the code would not, in fact, be safe or easy!

The crashing was caused by the simulator “message queue” getting backed up. A server-to-viewer message (related to the mini-map) was updated and changed to move over TCP (reliable, but costly) instead of UDP (unreliable, but cheap and fast). On regions with many avatars, this would cause the simulator to become backed up (storing the “reliability” data) and eventually crash. We have a configuration file switch that allows us to toggle individual messages from TCP to UDP on the fly, but while testing we discovered a second issue – another file necessary for the UDP channel needed to be updated, and it could not be changed on the fly, and if we flipped the switch back from TCP to UDP the simulator would crash. (The TCP to UDP update on-the-fly worked, which is how we were able to do the rolling restart in the first place.)

By testing on individual simulators, we were able to confirm that by switching back to UDP the problem was eliminated, although this required stopping the simulators before throwing the switch. We co-opted an existing tool used for “host-based” rolling restarts (which had been used once in the past), and had it shut down simulators on each host (doing several in parallel), update the two configuration files, and restart the simulators. After significant testing, we used this tool to perform another rolling restart of the service, which was completed by 11pm on Friday, including subsequent cleanup.

Saturday, November 10th

Unrelated to the deploy (but included here to clear up any confusion), on Saturday at 5:20pm we suffered another VPN outage, which resulted in hundreds of regions being offline for just under two hours. The cause was due to the expiration of a certificate used for the VPN. We replaced the certificate, and our DNOC team brought the affected regions back up.

What Have We Learned

Readers with technical backgrounds have probably said “Well, duh…” while reading the above transcription. There are obviously many improvements that can be made to our tools and processes to prevent at least some of these issues from occurring in the future. (And we’re hiring operations and release engineers and developers worldwide, so if you want to be a part of that future, head on over to the Linden Lab Employment page)

Here are a few of the take-aways:

  • Our load testing of systems is insufficient to catch many issues before they are deployed. Although we have talked about Het Grid as a way to roll out changes to a small number of regions to find issues before they are widely deployed, this will not allow us to catch problems on central systems. We need better monitoring and reporting; our reliability track record is such that even problem such as login failures for 1/16th of residents aren’t noted for a significant period of time.
  • When problems are detected, we don’t do a good enough job internally in communicating what changes went into each release at the level of detail necessary for first responders to be most effective.
  • Our end-to-end deployment process takes long enough that responding to issues caused during the rollout is problematic.
  • Our tools for managing deploys have not kept pace with the scale of the service, and manual processes are error prone.
  • Track date-driven work (e.g. certificate expiry) more closely; build pre-emptive alerts into the system if possible.
  • Be more skeptical about doing updates while the service is live, especially when involving third-party providers
This entry was posted in Service Metrics. Bookmark the permalink.

109 Responses to Second Life 1.18.5 Server Deploy Post-Mortem

  1. sirhc DeSantis says:

    Thank you. We just like to know whats going on. This is welcome.

  2. Suzi Phlox says:

    Yes, thank you for keeping us informed. I am glad to see that you are learning from this rather shoddy performance. Blame assessment accomplishes little but fixing the problems does.
    In my operation we try to always keep in mind the old saying: Dont let your reach exceed your grasp. Linden Labs needs to engrave that everyones forehead.

  3. Disraeli Calderwood says:

    Awesome play by play, thanks so much for the communication! Makes me feel more on the team (don’t take that as willingness to run cable though…) 😉

  4. Bart Kovacs says:

    Gutsy story, nice to hear that you are at least honest enough to confirm that you have not kept pace with the scale of the service as it is now.

    I’m awaiting anxiously to hear when you have reports about possible solutions to prevent the lessons learned in the future ;o)

  5. Darien Caldwell says:

    Thanks for the timeline. This sort of communication helps us residents better understand the frustrations faced by the teams working to improve the infrastructure. We know you are putting your best efforts into it. 🙂 Keep up the good work.

  6. Dreft Jurack says:

    Out of interest do you publish a list of which sims are on what server locations? I’m curious since I live in on a corner joining two of them and have a tough time crossing both and usually have to relog to put all my attachements back in order. But in any case, I ‘d be interested to know where I live physically.

  7. Linus Rossen says:

    This is the sort of transparency we need. Thank you for letting us all know.
    We’re all stakeholders in this, now onto the hard bit – learning from it…

  8. Marcuw Schnook says:

    Well if that means I will be finally able come back to SL for more then a few minutes, I truely hope you take lessons from this.

    That said, /tiphat for coming out as you do right now about the mistakes made.

  9. elvis woodget says:

    thx for the info, thats what we all need!
    alot ppl in game cant understand how it works at all but with few of this post he will at long time understand how it works and know what linden do to run it well.

    regards

  10. SharpKnife Beaumont says:

    Hehe what an conffesion, yes it was a rough week but messing with hardware and updates, (dont hit Murphy’s Law there! )
    And to be honest Lindens I think lot’s off residents will like this reporting. And to al this who say hmm other worlds come up,
    Sure still years behind on what is Sl today. And is a super complicated system to keep running. And pls dont forget if one router
    burns out on your path to the servers, just bad luck but not the fault off LL. Guys Sl still rocks and kuddo’s to the concierge team!

  11. Nano Ashby says:

    It’s a really good breakdown of events, and it’s reassuring to see that you are attempting to evolve procedures and protocols to grow in line with the scope and complexity of the Grid/SL

  12. Nouar DeCuir says:

    “Well, duh…”
    Got knocked offline by a typo!

    Sounds like you have an ‘interesting’ week to look back on…
    Wonderful post, grid still alive: job well done. Go have a beer on me.

  13. Seraph Nephilim says:

    Given that the connection between the two facilities has repeatedly failed and seriously impacted SL, you should also be considering redundant connections that use different paths/certifications/technologies/etc. This way if one fails you still have the other. This is obviously a major choke point for SL.

  14. Forest says:

    Sounds like everay rolling restart is a different breath taking story…
    Thanks to show what happens behind the curtains a little bit.

  15. Hal says:

    ‘Dont let your reach exceed your grasp” is a phrase guaranteed to deliver nothing but diminishing returns, one of the worst maxims ever uttered IMO…

    Reach for the stars, see where u get to…lag be damned… 🙂

  16. Takat Su says:

    1. Thank you for the above post mortem. it actually is somewhat of a relief to see that you are looking to your mistakes and trying to rectify them. Having had to update distributed systems, I understand it can be, well, interesting, the way they can find failure modes that are seemingly non-deterministic and related more to the price of chocolate beans in some remote village in the Andes Mountains.

    2. (and this is not an accusation it really is a serious inquiry) do you practice roll outs on some system – i.e. create a deployment checklist and then build up a (mini) system that is a duplicate of the main grid, give the checklist to someone who was not involved in its creation, and have that person do the checklist to catch any deployment checklist errors? I realize this would NOT have caught some of the above problems, but it is a question for future deployments that may be just as complicated.

  17. Thanks Joshua!
    This is good communication!
    We really appreciate!
    Sincerely,
    Lukas Mensing

  18. Milo Bellow says:

    “believe only half of what you see and none of what you read…”

  19. Maldoror Damone says:

    I’m not sure whether to laugh or cry after reading all that……

  20. Psistorm Ikura says:

    Thanks a lot for this extensive post mortem. Its approved that you go to such extents to inform us of just what went wrong. Also it is good to read that you draw conclusions from last week´s issues and devise a set of useful pointers to keep in mind for now 🙂

  21. Moose Maine says:

    Great job of tracking what your doing. A year ago, there used to be massive in-world notices when problems are encountered. Many of us fought the system all weekend, but didn’t notice any linden (“hey we’ve got a problem, and we know about it”‘s…). I think many of us would have just stayed offline and out of your way had we known.

    With that being said, and having worked in those type of server rooms, kudos to those that went with no sleep this weekend. We feel your pain and really do love you! You ARE doing a great job!

  22. Christi Maeterlinck says:

    Oh, Linden dear Linden, poor dear system…
    Slow down, as every single one of your users has been telling you for over a year now. Sort out the old effing errors before developing super new services that create new ones.
    Looks like you’re beginning to notice the message your community is sending you.
    Thank you!

  23. Blinders Off says:

    You know what? I know you folks have a lot on your plate. But I have NEVER seen any company have so many connectivity problems, so many server outages, so much downtime or so much whine whine whine problems. I’m sorry, but you folks are charging more for your service than ANY hosting company in my experience.

    I know a guy runs a little back-street server shop. On his wall is a great big sign: “100% Uptime guaranteed or your money back”. How does he do it? Redundant servers and mirror backups, just like any reliable service. When was the last time you saw Google offline? How about Yahoo? Microsoft? Quake? Unreal? (Yeah right, like Linden Lab has more cutting edge activity than any of those companies).

    When you charge more for an island than it takes to buy a new car… I have very little sympathy for downtime or excuses for such. You’re playing with the big boys now. Time to act more like pros than game-playing teens. Sorry, no kudos for this one folks. When you charge $5000+ a year for virtual space, you have an obligation to your customers to get it right.

    I’m not trying to bust your chops here. Oh wait, yes, I guess I am in this case. We’ve had enough excuses. If I may suggest: stop playing games with new “features” and get to work stabilizing your platform.

  24. Drako Nagorski says:

    well first off, TY JOSHUA LINDEN!!!! havent seen this kind of real transparency for…. well ever! WTG 😀
    and now for the bitching that you should know youre gonna get:

    ” * Our load testing of systems is insufficient to catch many issues before they are deployed. Although we have talked about Het Grid as a way to roll out changes to a small number of regions to find issues before they are widely deployed, this will not allow us to catch problems on central systems. We need better monitoring and reporting; our reliability track record is such that even problem such as login failures for 1/16th of residents aren’t noted for a significant period of time.
    * When problems are detected, we don’t do a good enough job internally in communicating what changes went into each release at the level of detail necessary for first responders to be most effective.
    * Our end-to-end deployment process takes long enough that responding to issues caused during the rollout is problematic.
    * Our tools for managing deploys have not kept pace with the scale of the service, and manual processes are error prone.
    * Track date-driven work (e.g. certificate expiry) more closely; build pre-emptive alerts into the system if possible.
    * Be more skeptical about doing updates while the service is live, especially when involving third-party providers”

    well you said it yourself…. FIX IT 😀 youve done good so far… also read the blogs? maybe just a little?

    **on review of some older blogs i found posts by Lindens…. the helpful ones were from Joshua and Jack Linden. WTG 😀

  25. Elrik Merlin says:

    Thanks for the transparency… this helps us a great deal to know that you are willing to own errors and learn (we hope) from mistakes. The more you are this open the more we can respect you. It’s been said elsewhere, but I still feel that solving old problems is more important than adding new features. Good luck and thanks for keeping us informed.

  26. Dekka Raymaker says:

    “and the simulators were suck waiting.” (our simulators suck?)

    I guess that was a Freudian slip then, rather than just a normal typo. 🙂

    Thanks for the info, even though I’m not a techy I believe I can understand what happened, good post.

  27. Porsupah Ree says:

    At any point, was any Linden heard to declare they’d chosen the wrong week to stop sniffing glue?

    (For the unfamiliar: http://www.imdb.com/title/tt0080339/quotes)

    Seriously, thanks for the report. It’s a welcome insight into a fur-raisingly complex process.

  28. Fledermaus Messmer says:

    You Lindens are simply amazing! I have never seen a company so candid!! Any change in so complex a system is difficult, and you are definitely improving with each major change. I SO wish there were a position available, especially work-from-home, that I were qualified to fill!!!

  29. Magian Merlin says:

    Very honest post. I really appreciate it a lot to here the whole truth and nothing but the truth coming from Linden Labs for a change.

    It’s also a very good assessment of your current weaknesses. Acknowledging them is the first step in solving them. There is no point in being in denial about them or covering them up. The residents will see right through it if you cover it up. Honesty is appreciated as can be seen from most of the replies to this post.

    Again, great job guys.

  30. Travis Lambert says:

    @23

    Indeed, Linden Lab *does* have more cutting edge activity going on than at Yahoo, Quake, Unreal, etc. Of course, ‘cutting edge activity’ is a subjective term, so we can both be right & wrong at the same time 😉

    I’ll say this though: When I’m not playing Second Life, I’m playing Everquest. You know… that old MMORPG that’s been around since 1999? 😉

    I play on the Nameless server. At least once a week, the whole server crashes… sometimes for hours… and not a word from Sony. At least every other day, one of the regions I’m in crashes. For months, I’ve been plagued by the infamous “West Bug” – which…. you guessed it…. makes it so my entire screen blanks out when I look West. (Its really great trying to grind out xp in a dungeon when you can’t look West) 😉

    Everquest doesn’t even begin to have the dynamic content that Second Life does…. and inherently with dynamic content, comes system challenges.

    My point is: Second Life, and Linden Lab by no means have a corner on the market of system instability. Everquest, which is run by Sony – also has its share of system challenges. They can’t even fix a West Bug that affects anyone running nVidia 7xxx cards after months of complaints.

    I think Linden deserves some Kudos for their post-mortem here. That aint being a cheerleader – its giving credit where credit’s due 😉

  31. simon sugita says:

    im sure stabilizing is the main thing ll already does if u read clearly what these updates and rolling restarts and new hardware were supposed todo yes new addons/functions for sl made sl mayby less stable i noticed this too but also made sl much broader to mass public and the more people interessed into sl and ll the more it will benifit Our vr world called sl 😉 im sure

    Ty Linden*labs for keeping us residents updated about whats giong on and good luck to us all (seems u could use it most)

  32. Argent Stonecutter says:

    It seems to me that you could have used the het-grid capability to test-update the simulator changes, since the simulator-central-server protocol was flexible enough that you were running the old simulator code with the new central-server code.

    Perhaps it would be advisable to treat central server updates as separate operations from simulator updates?

  33. Solomon Draken says:

    I really appreciate you being candid and telling us directly what happened. I can only speak for myself and some of the residents who have echoed my opinion before that as long as you tell us what’s going on we’ll be much happier. I fully understand how complex a system you are working on and with this many people using it an it being updated all the time stuff is bound to happen. Again thanks for sharing it with us straight up.

  34. Slartibartfast Magicthise says:

    In my company, we follow the simple rule “Do No Harm”. But usually what we do ends up necessarily destructive, and we end up in the “Rob Peter to Pay Paul” mode. This usually upsets the people on the short end of the stick, and since the customer is always right, we put our nose to the grindstone and present them with a “You Scratch My Back We’ll Scratch Yours” proposition. Time is money and the early bird gets the worm, so it’s our “Early To Bed Early To Rise, Makes A Man Healthy, Wealthy, and Wise” approach that allows us to stay ahead of the pack. Sure, we’re busier than a long-tailed cat in a rocking chair factory, but if we can build a better moustrap, people will beat a path to our door. Although we can lead a horse to water, we can’t make him drink, we can still make bacon while the griddle’s hot. Since the pen is mightier than the sword, you must keep in mind that those who live by the sword also die by the sword. Always remember; “Only You Can Prevent Forest Fires”.

  35. Minimalillusions says:

    Thank you and burn in hell quite while im opening my shop. Damned.

  36. Gully Miles says:

    Joshua, thanks so much for this.

    You come across as an intelligent group of people, making it up as you go along (in the best possible sense). And explanations like this are welcome — partly because as a resident it’s good to know what went wrong, and partly also because it’s good to know that this level of careful thought goes in to the post-mortem analysis.

  37. Vylixan Fallon says:

    Thank you very much for the explanation. Its appriciated . Thank you ..

    Keep on doing the good work. you get a lot more understandings when you explain things 😉

  38. bobbyb30 zohari says:

    So how long did it take to figure out….

    Its always interesting how you ALWAYS blame the hardware!!!(Getting rather old and lame, find a new scapegoat).

  39. Sue Saintlouis says:

    Thank you Joshua!!! We need more posts like these on the blog!

    Having had to go back to 1.18.3 (1.18.4 crashes in less than 60 seconds for me), I hope that 1.18.5 will work better :).

  40. lucy lukas says:

    Good work guys , thanks for the information, its great to be kept informed , thanks 🙂

  41. Dirk Felix says:

    Please do te same for the platform and client – DUH

  42. Irie Tsure says:

    If only LL communicated this professionally concerning the six-month group notice failure ‘critical’ issue. *sighs*

  43. Mackloutz Broch says:

    it would be really nice if we didnt have to download the game.
    it would cool,like runescape,you dont have to download anything.
    because i cant play the game because it wont download the new version

  44. selena weames says:

    this is great…now can you tell us why myself and many many others can not teleport without crashing since downloading the new viewer? it’s not our caches, it’s not our firewalls, it’s not our attachments. it never happened before the new viewer. help us please.

  45. I’ve always thought that LL needed someone who’s responsibility was to review changes and write a change log.

  46. Istephanija Munro says:

    Thanx for sharing the recent updates, interesting read… and bravo to the “what we have learned” chapter, i hope you can sort these things out, it would certainly improve all our experiences!

  47. Millie Thompson says:

    Good work LL! And thank you for the explanation of the work you have provided us. Things have come a long way since 12/18/2002 ^^

  48. Laraya Mills says:

    Josh – really appreciate some contents of your posting. But do not appeciate the header…it is not “post mortem” really – for your users (that is “we”) – that is just the very normal daily experience – no matter what happens at yours’ in the background. Can only say – as far as I am concerned – the days when everything runs smoothly and without any problems belong to the days, which I experience to be exceptional. When I experience a day that runs smoothly, it makes me suspicious even. The issues to happen neither do not really change – nor the stories around them (except some technical details). If you should ever get tired in writing such postings, for that very part (to get tired of writing such postings) you shall get my full understanding. I think to simple maybe (Europe here) – but a thing people pay money for is supposed to work – andthe moment, when it dies not, that is supposed to be the exception, not the rule. Good luck further – Lara

  49. Kristian says:

    Cut the handwringing and get Windlight and New Search out!

  50. A few quick answers:

    Dreft Jurack asks, “do you publish a list of which sims are on what server locations?”

    No, as this is something that may change without notice. However, as a convenience to ourselves, we do maintain regional clustering – sim1000 and sim2000 are more likely to be in different physical locations and/or communicate over different VPNs than sim3456 and sim3457. (You can look at Help > About in the viewer to see which sim is currently running the region you’re in. But remember, that can change!) We also try to ensure that adjacent regions are in the same colo facility when possible to alleviate region crossing time issues.

    Takat Su asks, “do you practice roll outs on some system – i.e. create a deployment checklist … to catch any deployment checklist errors?”

    Yes, albeit without the level of detail you suggest due to time and resource limits. (Did I mention we’re hiring?) I’m normally intimately involved in the server deploys and create said checklists and get them reviewed, with rollback clauses in many cases. In this case, I was actually the person pushing the buttons for most (but not all) of the steps involved. (Did I mention we’re hiring?)

    Dekka Raymaker points out a typo – fixed!

    Argent Stonecutter suggests, “you could have used the het-grid capability to test-update the simulator changes”

    That’s true. The extant het-grid capability, however, is just that – a capability. We are not yet at the level where we have good tools for using it; those are being developed. In practice, our use of het-grid has been prone to operator error and so we have not used it as much as we could like. Getting over that hump to catch issues earlier is becoming high priority.

    Argent continues, “Perhaps it would be advisable to treat central server updates as separate operations from simulator updates?”

    Yes – in fact, for future “live” deploys I’m planning to do centrals one day and simulators the next, since it’s no longer practical to do everything on one day. This provides opportunities for additional testing in this state.

  51. Pingback: [COMPLETE] Rolling Restart: Coming Soon to a Sim Near You « Official Linden Blog

  52. Kahni Poitier says:

    How long until the servers (my guess would be asset servers, b ut what do I know) are fixed to the point where I don’t have attachments locking up on me when I teleport? If it’s not my AO that freezes on me, it’s one of my HUDs, if not not that, the hair doesn’t rez, or the shoes won’t rez, or I can’t remove items that DO give me issues.

    It’s getting REALLY told to teleport, relog, teleport, relog……..

  53. Kahni Poitier says:

    err, getting really “OLD” to teleport, relog, ………

  54. Ron Crimson says:

    We need more blogs like this one. Then release them as a pocket book named “Linden Lab: What Really Happened Behind The Curtains.” It would make a kick-ass thriller to read during my next vacation. LOL 🙂

  55. Ravanne Sullivan says:

    How about retasking your VPN to the backup/load balancing role it is meant for and getting a reliable dedicated connection to the co-lo?

  56. chrism mollor says:

    Guys

    This is what we want to see:

    “What Have We Learned”

    Thankyou.

    yes – there are a lot of “well duh”s in there. Hell – there always are when you are doing this sort of thing. What I feel in the past is that while you may have learned things – they may not have been made public or possibly unlearned very quickly. This was – shall we say – “sub-obtimal”.

    What I appreciate in the above is that you have given facts, related those facts to what was going on, reflected on them and come up with ways forward. Good.

    I’ve often run folks off site not becuase of mistakes – but becuase they didn’t learn from them – or even give indication that they could learn from them.

    Mistakes happen and systems need to grow and develop. This I think most of us can understand and, to some extent. appreciate. But to not listen to feedback and to not reflect on those mistakes – well – that is plain bad. But…

    This shows that things are moving in the right way.

    Thanks for your report and I hope (only sort of) to hear more of these detailed post-mortem’s later;-)

    Also – don’t forget to pot the things that went well. That is equally important. The residenents need to know that every update also has it’s up sides – so do you.

    Please keep this up. It’s appreciated.

  57. Pingback: The Grid Live » Second Life News for November 13, 2007

  58. The explanation is nice but still lacks many, Many things, such as:
    Warnings… It amazes me and many other sim owners and members how a company that can pop up a Blue window to every single person online, doesn’t bother to warn its users of things that can be effecting their income, their profiles, their visitors, land titles and descriptions.
    How an update has:
    Deleted or seems to have deleted scripts from vendors.
    Has Groups tools working when they feel like working.
    has stopped some objects from accepting payments.
    How the last rolling restart was a Rollback and changed many things a user may have done before hand.

    So SL users go and check all those things NOW !!!

    Too many are tired of hearing:
    We’re sorry for the inconvenience
    and
    Lets Restart your Region

  59. Daten Thielt says:

    Briliant blog going, about all your hardware problems, is it for the sim servers or other servers, if it were me i would find a decent stable hardware setup then use that with everything i use, that way hardware problems would occur alot less

  60. chrism mollor says:

    @Joshua Linden…

    “Yes – in fact, for future “live” deploys I’m planning to do centrals one day and simulators the next, since it’s no longer practical to do everything on one day. This provides opportunities for additional testing in this state.”

    Nooooo….

    Leave a day inbetween at least. Let things settle. Let the team take a step back and appreciate what they’ve done and also get some rest.

    Updates are stress. Too much stress degrades performance. Not enough testing leads to wrong assumptions (witness above).

    Leave at least a day for the dust to settle…

    please…

    you know it makes sense…

    And di we say – thanks for listening to us?

  61. Bee Mizser says:

    Thanks Joshua for the candid and honest appraisal of what went wrong. It’s good to see you recognise your weaknesses.

  62. Thunderclap says:

    Great Job of explaining in detail what happened, what you did about it and the resolution. Excluding 36, (you need a new video card) I think this is the most positive blog response I have seen.
    This is bleeding edge stuff, and I respect that you guy are being honest and saying it.
    And I think that the other person wanted to simply know if his sim was on a blade that was active at which colocaation facility.

  63. Judi Newall says:

    TY for the info, I understood about half of it but I’m learning fast! Much appreciate these kind of posts though.

  64. johnny says:

    so how about a refund for the sim owners

  65. Gracie Foden says:

    Very thoughtful for you to take the time to explain and keep us up to date.

  66. roland francis says:

    How do you mean, good news? Compliments on that change of the unwritten rules 🙂

  67. Drako Nagorski says:

    hmm i cant post my bug 😦

  68. Drako Nagorski says:

    OMG OMG BUG!!!!!!!!!!!

    i cant place objects from my inventory on the ground. the object doesnt place, and the selection particles apparently go from the center of my HUD to wherever i try to put the object. ive sent multiple bug reports on this and a camera bug that focuses on the HUD instead of inworld. im posting here cause Joshua replies to posts and i dont trust the bug reporting process.

  69. sylvie matova says:

    @23 [..]I know a guy runs a little back-street server shop. On his wall is a great big sign: “100% Uptime guaranteed or your money back”. How does he do it? Redundant servers and mirror backups, just like any reliable service. When was the last time you saw Google offline? How about Yahoo? Microsoft? Quake? Unreal? (Yeah right, like Linden Lab has more cutting edge activity than any of those companies). [..]

    You’re having a laugh aren’t you?

    Google – Gmail often goes down, Google chat about once a week.

    Yahoo – Mail services extremely unreliable and they’ve spent two years trying to get their new email service out of beta and failed.

    Microsoft – Where do you want me to start? Windows update server very intermittent, downloads have failed twice this year telling all customers that they were unlicensed pirates.

    Ebay – Take most of the UK site down every friday morning for maintenace. (or they used to – I’ve given up using it on friday.

    And if you think Linden Labs aren’t cutting edge then please let me know what is. MMRPG systems like this are where it’s at. The others are just dumb websites.

    On a better note – Thanks Lindens for posting a very detailed and mostly helpful blog entry. It’s what everyone has been crying out for for ages. MORE PLEASE.

    One thing that would make it better, please post inworld notices when these issues are occuring. It used to happen and is very helpful.

  70. Drako Nagorski says:

    ¬¬ respond to my bug #69

  71. Cindy says:

    Thank you… finally I see some light, after beeing so long in the dark!

  72. Marianne McCann says:

    Thanks.

    More like this. 🙂

  73. Yami Katayama says:

    I’m normally a little critical of what gets posted here – but I have to say that I’m impressed with this level of confession that things simply didn’t go right, and a list of lessons learned from this entire fiasco. Now, as long as someone actually DOES something with this list and it doesn’t just become something that someone – a year from now – looks back and says “DUH! We should have seen it then!”.

    I am very glad that we are being given a blow by blow of what happened – we, the every day users and merchants, are some of your largest investors when you consider what we do to stimulate the ecomony. I do think we deserve some insight to what is going on with our investment. As one person put it, the cost of owning a sim “cost more than owning a new car” – with this kind of investment, we do deserve some feedback as to what is going on – and some accountability. I finally feel like we have seen some accountability in this post. Now, if someone actually takes the next step and implements change to do something with that accountability, things might actually get better.

    I also agree with another poster that you guys REALLY need to listen to us more – We tell you things are wrong and we get brush offs and “the company line” answers. A lot of us feel like our ability to interact at all with LL has been taken away. Live help gone and any support at all is hidden behind the web site somewhere. We realize that this entire thing is a work in progress – but if you don’t listen to the people who are USING the product on a day to day basis, how can you possibly judge how it’s working? For a LONG time now, people who use the product have been saying “fix stability” and “fix what’s wrong BEFORE you implement more features”. Maybe stop announcing these “features” LONG before they are ready so people will stop hounding you to get them out BEFORE YOU FIX THE UNDERLYING PROCESSES THAT WILL SUPPORT THEM! If those are not stable, you cannot possibly expect that anything will be MORE stable with the new features on top of them. Pushing to get them out is only going to make underlying, broken or crippled processes worse.

    … and to Slartibartfast Magicthise… wtf???

  74. Lenny Looming says:

    I would love to work for LL. I’m a single 35 y/o guy with a programming degree who lives and breathes all things virtual and have no ties that bind. So when you guys gonna send me a plane ticket? I can be there yesterday…

    I’m actually checking out the job postings now to see what I might be interested in doing.

  75. alf lednev says:

    The explanation was nice Jishua, not really honesty more a confession. Everyone inworld new that release sucked but only after 2 weeks ,, a belated confession?

    Hiring staff? try and hire some competent IT Project Managers, The IT project Management protocols have been around for decades, sadly it seems noone in LL actually knows how to apply them. Seems way to many techos running around and no real IT managers. The company may have expanded, but its mamagement practices havent. No 23 was right. when you charging big dollars to “play” in your world. you better start giving value for money. When the first viablle competion comes along, a lot of Lil Lindens will be unemployed as people leave in droves.

    Another old saying is “to little, to late”

  76. chrism mollor comments, “don’t forget to post the things that went well.”

    I refrained from doing so, but since you’re asking:

    * Despite all the difficulties, at least 20,000 residents were online at all times throughout the debacle.
    * During the worst point of the update (Thursday 1pm), we still had nearly 50,000 residents online.
    * Most residents and most regions were unaffected by any of the updates (apart from presence glitches and region restarts)
    * Where practical, changes were rolled back as soon as the problems became clear and the rollback was deemed safe.

    Compare this to monolithic updates in the days of yore, where following a 6-hour downtime the login storm would crush the servers resulting in 5 hours of follow-up work to restore the service, and days of lingering bugs.

    See, we do learn and improve things! 🙂

    chrism mollor continues, “Leave a day inbetween at least. Let things settle. Let the team take a step back and appreciate what they’ve done and also get some rest.”

    That’s a great sentiment. At least for now it’s not practical as we do want to keep the platform moving forward. We ship software to get fixes out and improve infrastructure, and we do want to get it out as fast as is practical. We also have a very small team doing the deploys as well as many other tasks, and there is pressure to get things done and out of the way. We’re working to evolve the platform so that development and deployment of different components is even further decoupled, but it’s gonna take a while to get there.

    Yami Katayama writes, “We tell you things are wrong and we get brush offs and “the company line” answers. A lot of us feel like our ability to interact at all with LL has been taken away.”

    Linden Lab has grown a lot in the past few years, and that can lead to challenges. When Linden was smaller, you’d probably have your issues heard directly by a developer who was involved in a change and could make a fix. Now there are more developers and a lot more support personnel trying to help and speed things along. We get more done, but this comes at the cost of not being as personally involved in every issue. (When I take the time to write things like this blog post or this comment, I’m not spending the time to fix some script issues or help two teams coordinate their projects.) It also means that feedback tends to be diluted. Trust me – there’s very little “company line” here. If you feel you’re not getting a straight answer, is probably because the person posting doesn’t know the problem as well as you do!

    (I’m not sure if that helped or even made any sense.)

  77. Blinders Off says:

    @30 You make good points. All of them valid. Sony isn’t exactly known for friendly customer service (I found that out when I purchased a Palm PDA. Haven’t seen those on sale at Best Buy in a long time. LOL. But… Sony doesn’t charge the price of a car to use their system either, do they? I can play Unreal and Quake free, for the cost of the software. When people are paying $295 a month for a piece of virtual land, I expect it to run like the new car it’s costing to operate. If I bought a car and the engine died 5 times a day and the car wouldn’t start for 45 minutes several times a month and I constantly had to shut it off and restart it just to get it to drive above 20mph… I think I’d be on the dealer’s doorstep the same way I’m on Linden Lab’s doorstep here.

    The basic problem is that LL is trying to shove too much into one box. There are thousands of people on SL who would be perfectly happy with fewer toys and a stable platform. If people a year ago had been given the choice between flexis and sculpties… or a lag-free crash-free platform… which do you think they would choose? If given a choice now between a stable platform and windlight… which would you choose?

    Shoot, if given a choice between GROUP TEXT CHAT working and the next gizmo LL has in mind to lag the system even more… know what I’d choose? Know what I’d rather have than the next toy? GROUP NOTICES actually getting to every member in the group. That would be amazing.

    @70. Don’t even get me started on Micro$oft. I’ll back you all the way there. I sometimes think Philip Linden is an alt for Bill Gates. LOL. BUT, as far as Google and Yahoo and Quake and Unreal and the others go… wha? Sorry dude, I’ve been a heavy user of the internet for years, and while there are occasional glitches in ANY system, I have never, ever seen Google totally offline, or their mail servers borked for any longer than a few minutes over a spread of several month. And I assure you that Google is just as bleeding edge as Linden Lab– if not moreso (when was the last time we saw Linden Lab draw in satellite information and display it on a world-wide map system… with users being able to construct buildings on that map?). As for Unreal and Quake etc, I never, ever lost an Unreal feed in the middle of a game. Not once. And that’s with about 20 players battling 30 or 40 npc monsters in real time and so much fireworks it was hard to see which was player and which was monster.

    So, we have three avatars standing around on a sim chatting, and wham, suddenly they can’t move, chat dies, and the sim crashes. Oh wow, real server taxing there…

    If Linden Lab charged $49.95 a month to host a sim (or even $95 a month) I don’t think I’d be saying much. But at $295.oo a whack plus a $1,650 setup fee to stack 4 sims to a server… at those prices I expect that engine to run smooth.

    * llTargetOmega and llSetRot hasn’t worked for months. LL knows this and has failed to fix it.

    * Group IMs been borked for months. LL is aware of that and has failed to correct the problem. That’s simple CHAT people. That’s not bleeding edge. Chat rooms have been doing it for decades.

    * Group Notices aren’t getting out to all members. What does that take that is bleeding edge? Looking up a list of names and sending a notecard to each one? Wow, that’s a tough one. It’s understandable why it would be taking tech support almost a year, and they still haven’t got that fixed. Accessing a datafile and transfer of a notecard must be a real nightmare.

    Would make Google Quake in their boots, it would be so Unreal. XD

  78. Jayden B says:

    Thanks for the info Joshua, many of us are starting to see the complexity of the world we “live” in. With so many intertwined systems it is obvious even small changes and human error can have big repercussions.

    One thought could be a presence of in world people, either Linden staffers or responsible volunteers (say drawn from Jira regulars) who can monitor in world and report weirdness.

    So many times we see the common issues before a major outage, even a few minutes advance notice that things are not right might help.

    This could well naiive, but sometimes I wonder if the Lindens are ever in world any more. 😉

  79. Deandra says:

    Sadly, I’m left to this arena for this. Apologies in advance…

    On the SL website, any attempts at logging in brings me right back to “Resident log-in” … no errors, just that. I enter the correct first and last name and correct password and blammo, back at the beginning. I can log into the game with no problems (same information), just not into the site to check transaction history. Anyone else?

    PS – Great job on the info <–see? I can be on topic 🙂

  80. Pingback: Computer Freezes with Second Life » Kabalyero

  81. Deandra says:

    Update: My nephew changed the date on my computer to the 30th of November. Apparently that was the problem — If anyone ever has that issue, check your computer’s date setting! 😀

  82. nel shan says:

    This is the kind of post I want to see more of.

  83. Funkfenomenon Felisimo says:

    I do have one overwhelming question…

    Why then are the Lindens not yet open sourcing the server as they (if I’m not mistaken) state they plan to do?

    I’ve been very active in the open-source environs for years and when a project is made available online there appears to be no end to what can be accomplished.

    Take for example the OpenSim project that aims, openly, to replicate everything that is done here in Second Life. The thing is, at the rate they are travelling with their development, they will undoubtedly surpass LL in a few year’s time and you can bet there will be a mass exodus when that time comes.

    I ask why LL doesn’t open source right away? Rather than hiring a bunch of IT guys they can get countless tons of free help from all over the globe to update, moditfy, tweak and polish the Second Life server software and it will also remove all manner of complaint coming from the general populus.

    It makes me think too because I own 9,704 Sq. M. of land “so far” and that’s $40 a month on top of my $9.95 premium. When you consider $50 a month that’s $600 I’m paying in a year right now.

    I know of other 3D worlds where user-created content is incorporated, like City of Heroes, and they only charge a flat $15 a month no matter what you build or how large your group is.

    I see the OpenSim project becoming the personal web servers of tomorrow where 3D “sites” will exist all over the place that one can fly through.

    So my stance at present involves deciding if I’d rather consider working for LL on a closed-source project that is bug-ridden, or devote my free time to growing the OpenSim project and doing my best to make it a bug-free reliable and steadfast virtual world server that will knock the pants off LL in a few years…

    Face it – competition is bound to be ferocious very soon. I mean, if LL boasts 10 Million registered users then it’s obvious someone somewhere is going to financially back a competitive environment unless LL opens the source to their server…

    Just my 2 cents…

  84. Weedy says:

    While I can appreciate your retrospective, saying nothing during the operation was inexcuseable. The inability to teleport and hardcrashing as result coupled with the absence of updated reporting, caused me to revisit my own system, causing me to needlessly reset my router and and make a tech support call to my ISP.

    If folks at LL were as forthright as they were at sweeping things under the carpet, you might actually gain some support. Unlike this way, which is far too “after the fact” PR.

  85. DR Dahlgren says:

    Interesting, and appreciated. On your first two bullets under Take Aways:

    I have been saying exactly this same thing about how the client and servers updates are performed since late 2007. While it is sad to know that we as residents, some quite skilled, are not paid any heed, it is gratifying to note that LL is now truly aware of these problems, and may actually do something about them to make the updates and upgrades a smoother path for us all, residents and Lindens alike.

    And just an FYI, as a Director Level Project Manager, well versed in software deployments, far more mission critical than SL, I have applied, several times. LL didn’t even have the grace to say “Thanks but no thanks”. Just ignored. 😦

    DRD

  86. archie lukas says:

    yah Information
    Thanks for sharing

    Just a thought – better ‘live news’ at the time of the problem do people with a problem know the limitations and reduce their own expectations?

    It would resolve a lot of cursing and teeth gnashing, mine are getting worn down, see?

  87. Kidd Krasner says:

    Speaking as an experienced SW QA engineer, I’d be happier if the take-aways from the post-mortem were more action oriented. Stating them simply as lessons learned is only the tiniest step towards using what you learned.

    So, for example, when I see that a typo caused a problem, my conclusion is that the process should be changed to require code inspections, or prohibit the bypassing of code inspections, even when in panic mode. (If it actually was inspected, then more digging needs to be done.)

    When I hear “manual processes are error-prone”, I conclude “dedicate a real budget to process automation, with a goal of eliminating almost all manual processes within some time frame (hopefully, less than a year). Even if it means losing a job req from some other place.”

    I think the worst, though, is “When problems are detected, … communicating what changes went into each release at the level of detail.” This is a red flag for anyone in software process management, indicating that the process is unmanaged. Perhaps it’s simply unfortunate wording, but the details of each release need to be in place before the release is frozen, not after a problem is detected.

  88. Matthew Dowd says:

    There are (anecdotally) complaints that the performance of the grid is deforming. There are also complaints of the client not performing well, and a large number of reports in jira of it suddenly eating 100% CPU.

    Many of these coincide with LL more service from UDP to TCP. I have wondered for sometime whether these changes are causing performance problems both client and server side. The problem mentioned in this analysis only supports that worry:

    “A server-to-viewer message (related to the mini-map) was updated and changed to move over TCP (reliable, but costly) instead of UDP (unreliable, but cheap and fast). On regions with many avatars, this would cause the simulator to become backed up (storing the “reliability” data) and eventually crash”

    Is there any analysis as to the possible performance problems being caused by moving UDP based protocols to TCP based ones?

    Matthew

  89. Thanks very much for the information its great to learn about new things!

  90. Milo Bellow says:

    Im Thinking All This Faux Honesty From LL Might Not Be A “Cunning (PR) Stunt” After All.
    Perhaps Its A Blog Version Of Truth Or Dare?
    If So…What Was The Question?
    😮

  91. Blinders Off says:

    @84 I agree. And know what’s sad? I know a way LL could charge LOTS LESS for land and sims and still make money hand over fist. But for the genius visionaries they are supposed to be, it is amazing to me they can’t figure out how to provide virtual land within the range of the normal pocketbook. I mean, I figured it out, and I’m no rocket scientist. That’s my beef with them. They’re charging Porche prices and delivering bicycle service. (Oh wait, bicycles usually operate correctly. Let me restate that… LOL)

  92. DR Dahlgren says:

    @ Kidd et Al

    Clearly watching the way client and server code is released with bugs showing back up that had been removed several versions back, shows that it is not “unfortunate wording” but a true problem.

    I can not even count how many times in the year + I have been a member, that a roll-out contained a bug that had been fixed, and later it is learned that it was not a new bug with the same symptoms, but in truth, the same bug put back into a new release. The appearance being that different pieces of code that go into a release were not properly tracked, and that old code with the bug was used instead of the code with the fix.

    Simple human error, yes. Anyone who works with versioned examples has probably done the same thing. No biggie when you are trying to sort what mod to some shirt is the latest you created, and you are the only one working on it, but a major problem when there are many people working on something, they all have different versions that appear the same according to the version data they have.

    Hope this all gets sorted. I do think that providing the details is a good start. It gives us some assurance that the process isn’t as totally haphazard as it often appears.

    DRD

  93. Amanda Ascot says:

    I might be a little strange, Joshua, but I found your report to be most interesting. I understand enough of what was going on to appreciate the magnitude of this sleuthwork. I’ve stepped in, before, when people were whining about things going wrong after an update, to argue in the defense of the people responsible for managing this complex system you guys are keeping running, however shaky it might be at times.

    I appreciate your candidness about all this, and I think it points out, to those who actually attempted to follow what you wrote, just how complex the infrastructure for Second Life is. The average user just wants things to run smoothly. It isn’t that easy.

    I think it also points out, however, the need for something I’ve been calling for over and over. Upgrades need to be done on Monday — not in the middle or near the end of the week. Wednesday Madness has become almost an in-joke among residents who are then left hanging with an almost unusable Second Life for the rest of the week and through the weekend until everyone at Linden Lab drags themselves in on Monday to get to work on it. Give yourselves the benefit of an entire work week of five full days to fix the things that will always be broken with an update so that there is at least a chance that the weekends, when users are active with scheduled events and other uses finally have some spare time, will be relatively trouble-free. Even if it’s a minor update, this is important. You never really know how a complex network of processes is going to react to a small change somewhere in the system. If ever there was a real life test bed for Chaos Theory, Second Life is it.

  94. Reg Mannonen says:

    Wish I could get the five hours of work i lost and the inability to release my newest product on our grand opening day because someone forgot to renew a certificate, lol, but thanks for the update none the less.

  95. U M says:

    Will the Dallas point ever have better stablity then it had over the past year? Please please atleast try.

  96. Hewitt Huet says:

    I’m gonna be the black sheep, and say, what’s the point of a detailed log of cock-ups? Yes, it’s hard to maintain such a unique infrastructure… Yes, it’s difficult to manage some of the crap that goes on when words like “griefing” and “ageplay” creep into the lexicon… BUT, LL has the temerity to charge REAL $$ for this service, and if it was all free or beta or something, we’d laugh off these hiccups. Paying prem + mucho tier (or little tier) and hearing this stuff is depressing. Don’t forget, LL, you threw the doors open for free acounts, allowed unlimited alt creation and cut your personal relationship with residents to practically nil in the process. If you don’t stop aiming that gun at your foot (voice, sculpties, borked XML-RPC calls) you’re going to hurt yourself.

    The XML-RPC server has been almost dead for weeks, so much so that LL reopened a bug issue on the JIRA, and bumped it up to critical. It has not been addressed AFAIK, and almost every vending machine in SL used it… hence the huge change SL Excgange had to make in their WHOLE system (many many kudos to Apotheus Silverman, SLX God, for reopening the issue).

    Less talk, less new feature rollout parties, and more working on what’s broken is what we need, not CSI, or IBM, or signing up a million new residents a week. Please, don’t make this more work for the residents than it needs to be.

  97. Doris Haller says:

    Thank you.
    I don’t think this is the transparency I need… I don’t understand all those things: TCP UDP costly..

    I understand that there were problems and you think you ‘ve learned something. Very good.

    I am also happy that you sound as if you think you caused many pof the problems (and it was not my fault this time 🙂 )

    Next time, could you tell us a bit more details as it is going on? I mean not as detailed as you did now, just something like “we let the system settle for 30 minutes now” or “it became clear that this was not an anomaly. We are discussing contingency plans” so we know what you are doing and can bash you.
    Sometimes, when all the system is stuck and obviously something is wrong, I wonder if anybody at LL knows about that (as I understand now, you do not always).
    The status page would be the right place to show things like that, but it is too unreliable. How many times it said the grid was down when I was logged in, and how many times everything went wrong and this page claimed everything to be ok…This page should also have a “Veto” button, to report that I don’t believe what it says there. I woudl press it in the cases I described above, and you can count how many residents press it. If there are a lot, check your system.

  98. Pingback: Tenth Life : Like watching a train wreck

  99. Max Kleiber says:

    Very clear and very welcome post.

  100. Simon Nolan says:

    Thanks, Joshua, for your explanation of what happened. I really appreciate the “What Have We Learned” section. Though I wasn’t in-world during any part of the upgrade, I would have missed the Linden messages that others mentioned were not sent out. I hope that is something that will be considered as well as a learning opportunity.

  101. Rooke Ayres says:

    Joshua,

    Thank you very much for the full explanation. Being one of those readers with a technical background myself, I can fully appreciate the insane three day triage of the situation that your technical teams went through. I’ve gone through a few of those myself, and they are NOT fun. 😦

    Let’s hope that everything learned during the updates will make SL a better place to live in in the future. 🙂

  102. Tegg B says:

    Thx guys for the info, don’t let it slow you down from tying though, stuff is always going to happen, and you learn from everytime it does I guess.

  103. chrism mollor says:

    @77 Josh:

    “That’s a great sentiment. At least for now it’s not practical as we do want to keep the platform moving forward. ”

    Josh – about taken a day between releases – it isn’t a sentiment – it’s sound practical advice and really a requirement….

    We all know that sometimes we need to take a step back to move things forward. We also all know that after a stressy time – we need to kick back a little before we proceed. We have to feel that we have accomplished something – and more importantly – BE ALLOWED to feel we have accomplished something. Thats from the human side of things…

    From the tech side of things…

    If you don’t stop for a while and get your house in order – things get confused and bad things happen. you’ve seen this before in the past. The tech side needs to settle – it needs to acheive it’s regular operational state.

    Take that day to plan and interpret what you have seen. Use it to start to get control and understand what changes on one side affect in the other.

    Good luck with the next release and we hope that you continue to post the post-mortems – REGARDLESS of if they turn into a debacle or storming success…

    Openess is the way forward – perhaps more so than anything else?

    Chris

  104. ZigZag Freenote says:

    Lindens, PLEASE, restart & upgrade more often, I need more downtime. For whatever reason. I am not getting ENOUGH SLEEP!!!

    How about doing the upgrades in two steps. First upgrade some quieter sims and some sims that user would opt in (get a discount maybe). OK, with 4 sims per server this could be tricky to manage and I see you want to roll out as quickly as possible. But is that a good strategy?

    Sure, it wouldn’t trigger all the problems (like the one with slow connection cleanup), but while you have a good beta cycle with RC previews, the same is not well managed for servers.

  105. Jessica Hultcrantz says:

    Hey,

    Thumbs up for some decent communications from the lab for once.
    We love to get the full story guys, please continue that way in the future and we will all have a better understanding for the problems instead of just getting furiously mad about your lack of communications. *wink*

  106. Angsty Rossini says:

    Um, I was in the middle of a class wth 15 students and I have just crashed out and cannot log back in :”something unexpected has happened”

    Anyone want to go inworld and apologise to my students at Strayling for “abandoning” them??

    *sheesh*

  107. Anderson Philbin says:

    This is the best and most informative blog post I have seen since I joined last February. Seriously, I wouldn’t seek employment with LL, I really couldn’t take the level of user abuse you have to put up with.

    I look forward to the post which explains what actions you’ve taken to prevent a recurrence.

  108. ZigZag Freenote says:

    http://blog.secondlife.com/2007/11/14/difficulties-logging-in/

    Wow, that was a quick response to my request :))))))

    I think some better status updates would really be beneficial.

    It says GRID STATUS: ONLINE
    Online Now: 8017 (and falling)

    * Grid status should be yellow, saying like DIFFICULTIES or UNSTABLE (also ROLLING RESTART and such).

    * There should be a link to the status page, somewhere in the upper right corner of the sl web or in the main menu line. Or just below the L$ and get land. Hell, above. I know for a passer-by it doesn’t look good if the status is not green, but users would appreciate it.

    * In-world notices for new blog entries would be nice, plus when service related entries are updated. I am considering making a group for this, but I suppose LL should provide this so it would be reliable and timely.

    As can be seen from the responses here, people will be much less aggravated if there will be more interaction

  109. john canter says:

    In love with this site, fantastic. You might get a kick out of this Download Unlimited Xbox 360 Backups For Life Check it out.

Comments are closed.