Post-mortem: “Find” Database Issues

(NOTE: All dates/times are Pacific Daylight Time)

What follows is our post-mortem of the “find” database crashes of last week. Knowing that some Residents want the summary while others want details, both the Crash Obituary and The Full Coroner’s Report follows.

The Crash Obituary

We noticed an unacceptable level of crashing databases on Wednesday evening. In order to repair the problem, we reluctantly disabled all queries. Starting Thursday morning, we upgraded the databases, but we did not provide a status update during the day–an oversight on our part. By Friday morning, the upgraded databases were stable, and we re-enabled search.

The Full Coroner’s Report

We refer to the database responsible for fielding search requests and a few other lookups as the “find” database. In actuality, there are a cluster of database hosts that serve as “find” hosts, with one “find” host active at any time.

In the event of a crash, a new machine becomes active, and the grid continues merrily along its way. Further, these database hosts are used solely for read- only lookups, and never store the primary version of any data. As a result, a single crash is not cause for panic.

Unfortunately, these “find” databases crash on a regular, but infrequent, basis. We consulted with the database software vendors, and the suspected cause was an older version of the database software (MySQL 4.1). The vendor recommended updating to a more recent version (MySQL 5.0). That update – including the work to deal with the resulting incompatibilities –has been an ongoing project. Early last week, we completed final testing of the code changes to support this migration.

Wednesday morning (humming):

A configuration change was made to the databases to improve full text search results (some existing “stop words” functionality was turned off). After opening up the grid again, we noted no negative repercussions. The grid hummed merrily along through peak load (around 2:00pm.)

Wednesday night (concerned):

Starting around 10:30pm, the “find” database started to repeatedly crash. As mentioned above, isolated crashes have been happening for a while. However, the crash rate on Wednesday evening was excessive.

When a crash occurs, a new database host is tapped to become the “find” host, while the “crashed” host enters a multi-hour repair cycle. With the unexpected repeat crashing, we were potentially running out of database hosts suitable for use as “find” hosts.

The options were somewhat limited, namely

  • there was no evidence that recent configuration changes caused the crashes, given that it had survived the peak load on the system;
  • a configuration rollback at that time would have required shutting the grid down for several hours;
  • meanwhile, there were limited available resources (machines, time, people) to do that update;
  • there no way to tell if the configuration change was beneficial;
  • Another configuration change could repeat the same problem (if crashes were caused by merely a configuration change, rather than the new setting);
  • By midnight, the safest option was to disable the bulk of the queries (including search) going against the “find” database. The remaining database hosts would have minimal load and the spares would have time to recover. Unfortunately, even with search queries disabled, the active “find” database crashed again at 2:30am, and several times later on Thursday.

    Thursday morning (bleary-eyed):

    By morning, we had three new options:

  • Roll back the configuration change and observe if that had a notable effect (noting that the benefits wouldn’t be clear for 24 hours);
  • continue to swap database spares into play in response to crashes (since the crashes could easily have been coincidental); or
  • proceed with a more fundamental, but potentially risky, fix – upgrading the database software.
  • Since we planned to upgrade the database software anyway and the required changes had already passed internal testing, we went with the last option, while making sure we had the resources to fall back to either of the first two if necessary. We performed final tests involving some regions on the main grid, then pushed out updated dataserver code to the grid an hour later.

    Updating the dataserver affects every simulator machine that makes up the grid, but is invisible to the residents. In parallel, we started to upgrade one of the “find” database hosts to MySQL 5.0, as well as bring more spares with the older code online as backup.

    By this point, the team was in a “hurry up and wait” mode– nothing new to report, no changes. Our next step should have been to update Residents with a new blog post. Unfortunately, we failed to update the blog until the next morning.

    Thursday evening (expectant):

    By 10:30pm, a spare database completed a successful upgrade to MySQL 5, but this was longer than expected. Upon completion , we activated the database as the “find” database for the main grid, and most of the search queries were cautiously re-enabled.

    Friday morning (relieved):

    Since the “find” database remained stable overnight, the final remaining search queries were re-enabled on the grid.

    This entry was posted in Announcements & News, Bugs & Fixes, Service Status. Bookmark the permalink.

    103 Responses to Post-mortem: “Find” Database Issues

    1. LossAngeles says:

      >>Friday morning (relieved):

      LOL, Been there, done that !! Glad you got it sorted out 🙂

    2. Geeky Wunderle says:

      Awesome work guys, working in IT I could feel your pain as I read the entry.

    3. saijanai says:

      Good job with fix, bad job with reporting. Given a choice of which to choose…

    4. Foo Foden says:

      In my professional opinion…. Cool Beans.

      Besides, we knew it was down. Fun reading the narrative 🙂

      (string)Foo

    5. Oldsarge Dowd says:

      Later report but Very detailed, Good job with the fix….
      Thank you

    6. Dante Moreau says:

      thanks for working on this guys and for the info

    7. Seann Sands says:

      I too understand the pain of crashing servers- it happens…and it’s great to finally get real hard info about what happened…but, this still doesn’t answer the basic question: what about the 1+ days of in-world search downtime which equals thousands of real dollars in classified ads held within that search database? Will there be any reimbursement to advertisers? A few hours is one thing…but, days! Is there anything written in the TOS that addresses this?

    8. Dirk Felix says:

      This really doesnt provide the warm fuzzies and I find this team “ataboy we did it”, bitterseet. With all the warts that is Second Life, we’ll see how long this lasts. Work on creating a platform that cant be crushed when adding new features. Spaghetii is best served with marinara, not as a theory to code.

    9. Celierra Darling says:

      Thank you for the update, and thank you for acknowledging the status update / PR breakdowns! You guys are still on the hook for next time, though – hope those lessons stick with you. 😛

    10. ColeMarie says:

      It’s fixed? 0__o

    11. Nice of you to take the time and give a detailed report.
      In one of the updates in the future, it would be helpful to have a draw distance of 32m for meetings in a small area. This would help by easing the demand on the server for data that is not needed.

    12. Able Whitman says:

      Thank you very much, Everett! I don’t know what percentage of SL residents regularly read the Linden blog, but I suspect it’s a relatively small minority. As a regular reader, though, I really appreciate interesting and informative posts like this one. From your post-mortem, it’s obvious that a bunch of folks at LL worked like mad to solve a tough problem with a critical system in SL, and to you and everyone else who helped solve the problem: thank you very much!

      You mentioned that it was an oversight to go without an update on the blog for so long, but I think it’s entirely understandable that in the midst of your efforts, a blog post wasn’t foremost in your mind. As you’ve detailed, you were faced with several dilemmas, and given a choice between talking about them on the blog and actually trying to solve them, I’m glad you chose the latter.

      Of course, LL has no obligation to blog about the messy details of operations, but in the future, I do think that even a brief update to say “we don’t know exactly what’s going on but we’re working on it” goes a long way. Somewhere I heard the phrase “we fill the silence with our own insecurities”, and knowing that you’re up to something, even if you’re not even sure what it is, is certainly better than silence. After all, the only thing you have to gain from updates that just say “we’re working on it” is the fact that residents understand that you’re (still) working on it. And how bad can it be for people to know how hard the ops folks work to keep things running? 🙂

    13. Daniel Korro says:

      Nice to have a such detailed report but this confirms some concern. A this time you’re telling us the “high availability” (for search) is based on a SINGLE active find database for the whole grid and spares. Guy let’s start working on scalability … otherwise I expect new problems coming in the future.

      The grid is great but the growth rate will require better architecture than a “wait until it crashes and replace with a spare” strategy.

      BTW. Thanks for being trasparent in the explanation it’s always difficult to explain what’s going on.

    14. nubee says:

      Fine, however it would appear that there are more database problems. artifact prims to be precise.

      As admin of a sandbox that gets its fair share of prims, i’m now dealing with prims that have no owner, no creator, and are phantom.

      Artifacts of long deleted prims. Returning prims does not cause this. only deleting.

    15. Nyu Tamura says:

      I live in http://slurl.com/secondlife/Tatarstan/180/75/96
      Seemingly I have the empty parcel, but it says that I have 124 objects ( they are transparent )
      I cannot return them in any way

    16. Celierra Darling says:

      The bug about invisible objects is known to LL and is being worked on – see https://jira.secondlife.com/browse/SVC-242 . You can comment or vote for the issue there.

    17. Chris aka. sploosh ribble says:

      So, yea how long is this going to take i think i missed that from the announcement and does it mean like this caused us not to go on; because i typed in my password and i put it on remember my password and it wouldn’t let me log in. But by the way thanks so much for updating this =D

    18. martini says:

      Thank you for couple of things…..a) The report. It does show how much work goes into this type of thing and (as important, though it might not appear so at first glance)…b) A recognition that you should (slapped wrists) Have been making some comment during this issue. I hope that you will see that not putting out the small fire just lets the thing (and comments) get more out of control. It does seem like a lot of people were busy, but there will always be someone who can type even short comments into the blog.

      Well done and …..well…..well done 🙂

    19. Simon Kline says:

      I loved this article, thankyou i hope you got lots of sleep after that effort. I think it highlights how unpredictable software can be on a large system no matter how much testing anyone does. Lets hope this sticks.

    20. Sofia Westwick says:

      If you are running into invisible Prims. that have no owner can not be returned trageted or deleted, But you can run into them as if they are there or they are showing up on your Object list but have no name.

      I have ran across this alot recently. Currently the only fix for this problem is to reset the sim. This will clear the ghost prims out until next time.

      There will be next time so be ready to reset more then 1 time is short amount of time. Until Linden fix for this problem is put in if it is not included in the update Later today.

    21. Tactical Dagger says:

      Thank you for that detail in explaining the situation.

    22. Good info! Many thanks.

    23. Now this was a blog entry. 🙂 *Applause*

      This did not sound like a lot of fun, but it is appreciated. I’ve been wondering why PostGreSQL isn’t being used, actually, but I expect that has a reason.

    24. Babyy says:

      OMG they are totally lost here arnt they?

    25. nubee says:

      Celierra, nope. different bug. recent issue.

    26. SimonRaven says:

      Wow, thanks for the report, nice and detailed. I work in IT too, but not big iron, but us little guys still know the pain of a malfunctioning database or other such issues to cause many a headache. Awesome work grrls and boyz.

    27. Benja Kepler says:

      Well done – an informative blog entry.

      I look forward to the new network indicators of ‘merry’, ‘humming’ and the like.

      😉

    28. Elvis Orbit says:

      Last week was a hard week but good job and thanks for the post. I hope that this next update goes well.

    29. I still can’t understand why Second Life is using MySQL to store high availability information. This is not a criticism, only curiosity… it doesn’t really sound like a ‘wise’ choice to me.

    30. Sanderman Cyclone says:

      Good to hear the details about these problems, and the way you’ve fixed them. I would love more blog posts like this.

      btw. gratz with the database, I don’t care about the late update, people need to set priorities.

    31. coldFuSion says:

      Thanks for the detailed post.

      Would love to see updated details on how the upgrade is holding up since.

    32. Shadow P. says:

      Thank you for working on it. I’m glad to see that you acknowledged that it was a mistake not to update everyone, please do this in the future. 🙂

    33. mo dryke says:

      perfect: this the way you must always communicate with us
      next step? do it in real time.
      hope is back gain.

      /don’t want pretty skies
      // want decent support
      /// want stable platform
      //// don’t care about who made a sculptie

    34. Pingback: SecondLife - How To Make Money In Second Life » Search problem post-mortem

    35. Alice McConnell says:

      Very good report. Please, report things that go wrong this way. It makes us understand you guys and helps us not get mad or frustrated. Thanks.

    36. Steve says:

      Thanks for the update

    37. Little Ming says:

      To be honest, I am with LL on this one with the fix it now, tell them later tactic. While it left many of the residents scratching their heads wondering just what was going on, we’d rather you fix the problem than spend valuable time reporting every detail to us while in progress.

      At least you took the time to inform us afterwards, which I am content with. Thank you LL 🙂 and thank you Everett.

    38. Damanios Thetan says:

      Thanks for the (delayed) update.
      One question though, did you only upgrade the (read-only) ‘find’ database or also their source databases to MySQL 5.0?

      Ergo, could the current transactional issues (sim->database (ghosting), client->database (uploads/saving)) on the main databases be related to this upgrade?

    39. Yay, mySQL 5!
      Now let’s just hope the update in 6 hours goes well, now that we’re moving to new DBs!

      The one thing I really love about mySQL is the fact that it can act as a relatively platform independent frontend to other database systems.

      Good luck to the ops team for the rollout!

    40. Victor Komparu says:

      Agree with everyone who says that fixing the database is infinitely more important than updating the blog. We love to see the updates, but we love to see it working even more.

      Sounds like some good work all around. Please keep up the good communication work.

    41. Tegg B says:

      Thanks again for informing us, see we don’t always bite 🙂

    42. Kinzo Nurmi says:

      Now that you are using MySql5.0, does that mean that classifieds and the like will become UTF-8 compatible-meaning that we can use Unicode characters in our listings?

    43. Selina Greene says:

      Thanks so much for the detailed post – would love to see that for every big glitch we see. Just one question… how can a crash be both regular and infrequent at the same time? 🙂
      A refund for classifieds would be great if you could manage it!
      thanks again.

    44. todd action says:

      this is te sort of transparency we have been begging for. thank you for telling us in detail what has been going on

    45. Deira Llanfair says:

      Well done LL team – good work, good report – a professional job. 🙂

    46. Ann Otoole says:

      You disabled find rendering our paid advertisements useless for 2 days and failed to tell anyone about it. An “oversight”. Right. You owe a 2 day extension on all classifieds.

    47. theshiningclub says:

      I for one am still wondering if the outage in Find and Search will somehow be compensated to those of us who pay for their classifieds.

    48. Inigo Chamerberlin says:

      First of all I’m very uneasy that you see running a system that suffers regular query server crashes (to the point of instituting an automated substitution and recovery regime) as acceptable…

      Combined with, on such an obviously unstable system, your attempt at both a mySQL upgrade AND a change to the configuration of the search system simultaneously was asking for trouble.

      Even on a stable system it’s generally considered best to make one change at a time – it makes figuring out what broke the system this time SO much easier,,,

      Hopefully a lesson has been learned here?

    49. Issarlk says:

      Oh my god. I wouldn’t want to be in your shoes. We have enough problems with MySQL at our place with some of our clients E-commerce websites (And I figure those get less load than LL servers…)

      I can’t see why MySQL is used where studies have shown that it can’t handle load increase and crash and burns at a certain point. I would have used PostgreSQL from the start, had I been designing Second Life. I have yet to see a PostgreSQL server crash, even one at our place that gets a constant heavy load (because of the hardware, which is undersized).

    50. Sera Cela says:

      Now THESE are the posts that we want to see.

      Thanks LL 🙂

    51. miika says:

      Now this particular debacle becomes much clearer, and hopefully Linden will see a positive result in return in how the userbase looks at the incident.

      There’s still silence on the subject of extending/reimbursing classifieds though. Perhaps someone from Linden would like to figure out what their position is going to be on that during today’s downgrade? It may “only” amount to 2 days of lost advertising, but the knock on effect (inability for customers to reach stores/sims) has cost people, and even a token gesture by Linden to compensate the more quantifiable loss would go a long way towards restoring some goodwill.

    52. Dave says:

      Good luck with MySQL 5. We use it for some seriously heavy stuff and I’ve just been working with partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html) which is ever so useful for large tables. Makes everything faster as long as you’ve got a clear index column to query since it splits the table up into separate files which can be anywhere. You may already have partitioning implemented. If not though, take a look. Might be worth a try.

      Dave

    53. Ann Otoole says:

      Linden Research Labs needs to extend the entire set of classifieds that were active during the outage by 2 days. If the database system were stable it would be a simple update query. However it is unwise to do this since the database architecture is clearly unstable. They need to grab all of the record id’s for these classified and then run a stored procedure to update each classified ad one at a time, committing after each 100 updates, and add 2 days onto each classified ad. Assuming of course they even have a database engineer on staff that knows what a stored procedure is *AND* knows how to write transactional procedures to ensure safe commits are verified. Given the sheer massive volume of database failures it is not logical to assume they have any credible database talent available.

    54. Vylixan Fallon says:

      Thanx for the report 🙂 reports are good /blogs too, silence is deadly 🙂

    55. Thank you for this detailed explanation! I can certainly empathise with the “nightmare” of MySQL crashing continuously just because somewhere, hidden in the code, is a tiny tiny bug that people will take (literal) years to find, and then all of a sudden go: “oh my, that was SO silly…” when you just suddenly find a way to rewrite a query subtly that will make MySQL never crash again.

      I certainly wish you from the bottom of my heart that you hit upon that tiny bug… one of these days 🙂 Since you have rewritten the code to go from 4.0 to 5.0, who knows, you might be much closer to a permanent solution. I seriously hope it’s the case!

      Issarlk, MySQL-bashing is a bit out of fashion these days. It was popular when MySQL’s popularity was on the rise, a few years ago, and everybody had “studies” to “prove” that RDBMS “X” was “better” than MySQL. MySQL is now mainstream; it means that it’ll basically handle everything you throw at it, and the only reason for it crashing, is mostly tricky queries that MySQL does not like, although they are SQL-compliant. Rewrite your application code, and the problems go away. In fact, that’s not different from any other mainstream database. To the best of my knowledge, all cases that I have ever gotten in contact with when people migrated from X to Y and suddenly found out Y to be “much more robust and stable” where all, without a single exception, badly coded queries, which Y managed to handle better while X didn’t like them.

      Stick to the RDBMS that you know well — and LL has been using MySQL for half a decade now! — and optimise your code for it, instead of changing RDBMS every other year or so in search for the “ultimate stability and robustness”. That is a chimera unworthy to search for; at the end of the road, you’re trading off years of expertise working on a product with all its quirks and tweaks for another one that has a new and different set of quirks and tweaks that you have to learn — and forfeit half a decade of experience.

    56. Kinzo — excellent point!!… That would be awesome!

    57. Damona Rau says:

      You did a good work, but this is your business… now you get applause like flightcaptian from the flightpassengers after a good touchdown, but why?!? it is there damn job…

      Much more importand is, what think Linden Labs about the tons of loss items and Money??? A damn good girlfriend lost 2 rental-systems, a lot of data and items for round about 40.000 disappears from her Inventory. But i know other peoples, they lost even more… much more. How you think to compensate this?

      Give us a stable system, and AFTER that, give us funny things, then you can get applause.

      I’m missing a statement for this.

      Damona

    58. Racal Hanner says:

      What an amazinginly informative report.Thanks guys.

    59. TT says:

      In a recent discussion with Cory Linden, someone asked why LL won’t upgrade their MySQL databases to 5.0, and Cory replied, in a condescending tone, that MySQL 5.0 is not stable enough for LL yet. The current blog post reaffirms that attitude by saying that upgrading to MySQL 5.0 “has been an ongoing project”.

      And yet, in light of the current problems, we are now told that not only is MySQL 5.0 suddenly stable enough for LL to use, but that it’s the instability of MySQL 4.1 (!) that caused all the issues in the first place.

      It looks to me as if listening to the resident (I forget who it was) who suggested upgrading to MySQL 5.0 a couple of months ago would have been far more productive and preventive than using the corporate “everything is under control” line on them. Don’t you agree?

    60. This sort of report is exactly the sort of post match analysis that makes the community a community. Intra-issue blogs would make a big difference too – even if it was just to say what functionality will be disappearing / appearing. That will be accepted well if the community believes that this sort of post will follow along once you guys get chance to breath.

      So good work, and keep up the open dialog.

    61. mcp Moriarty says:

      Here’s an idea to reduce database load. If I mute someone and they insist on sending spam anyway it ends up in my trash, and hitting the asset server anyway. That’s just stupid. It’s auto declined, send it to /dec/null as in lala land, not my trash. 😦

    62. Amara Barrymore says:

      This is all great and wonderful. I’m not one to complain about SL — cuz mostly I love it — but, when my cable is down for a day I get a credit. There should be something worked out for those of us who dump tons of lindens in Search since those days are not only wasted, but revenue is lost d/t absence of that utility. I know there’s nothing that can be done about revenue, but ad extension would be terrific.

    63. That sounds like a lot of hard work! Thanks, girls and boys!! I hope residents respect all the effort. When something works hardly anyone seems to think about it. When something breaks …

      Thanks again! L)

    64. Mel says:

      Now THIS was a good blog. It explained what the problem was, the decision making and the outcome and WHY things didnt just pop out fixed. Thank you so much for taking the time to write a coherent message that even non-techies could read and understand.

    65. Good writeup.

      Please consider changing when you first note something in the blog to when Search is disabled. Even an entry with comments turned off lets us know that LL is aware services aren’t working.

    66. I’m certainly with you tech guys and the nightmare, gallons of coffee et al. It’s clear that you guys had more important things to do than writing blog posts

      But on the other hand, in a company with 200+ emloyees (or so I’ve heard) and some of them working in customer communications, there must have been *someone* to have a free hand to keep people updated and informed.

      In fact I think a couple of post here (“we’re aware of it, we’re about to fix it, we’ll keep you posted, it should work now”), would have saved you thousands of related reports through the feedback and bug channels.

    67. Chris Anthony says:

      Wow – the horse has bolted, but good that you’ve shut the stable door anyway. People were left screaming for two days about missing search functionality with nothing but silence from your end. A post-mortem is just that – post the event. Updating folks on, at the very LEAST, your knowledge of the problem would have taken, at the very most, 5 minutes. These “over-sights” you mention are a real cause for worry – I hope you’re doing something about that – like established, published & distributed procedures would be a good start. For instance, in a previous blog statement it was written that a bug fix got “over-looked” & didn’t make it in the update. That tells me (among other things) that the bug wasn’t tested before the update was released – bad bad procedures.

      And you know – with the grid not being stable for any appreciable amount of time recently – do you think it’s wise to go ahead with this upgrade today? Well, clearly you do. I’m sorry for sounding so pessimistic, but I don’t have much to inspire me with confidence. Please, please surprise me !!

      Here’s hoping everything goes well with the upgrade today. Best of luck !!

    68. william Fish says:

      JUST POKING FUN REALLY… I FIND IT FUNNY….. poke @ LL

      “The Crash Obituary

      We noticed an unacceptable level of crashing databases on Wednesday evening. ”

      so what your saying is there’s an ACCEPTABLE level of crashing databases? 😛

      “Starting Thursday morning, we upgraded the databases,…” so this is why some (alot) of us had password “glitches” and had to reset our passwords?

      All n all im pleased with this post (minus leaving out the password glitch which still has me burning due to no response what so ever from a linden… oh there was robins responce but it was not public)

      Thanks Everett Linden for keeping us posted with how things are going and what lindens are doing. We need more of this to keep the masses at bay.

    69. Simba Fuhr says:

      MYSQL ??? *laught*
      thats not your serious !!

      mysql for more than 30 thousand requests per second ? OMG

      thats why second life crashes….

      USE AN OTHER DATABASE !!!!

    70. Moose Maine says:

      Dammed if you do, Dammed if you dont! As another resident that works in IT knows, these things take twice as long as expected and that even a well thoughtout backout plan is sometimes not perfect. The fact that Everett and the team went forward with a positive attitude shows that LL has a dedicated team we should all be proud of. There will always be those that are never happy, but for the other 98 percent of us, we thank you for your dilligence! Great Job!

    71. I’ve been involved with large IT systems before, and not being privy to the internal details of this one (beyond the tiny details that have been shared), I’m not going to play backseat driver. I did want to take a moment to add my voice to those thanking you for the increased level of transparency this blog entry shows, though. I hope it continues.

    72. Flack Quartermass says:

      Thanks for keeping us informed.

      Even detailed bad news is better than reading tea leaves (or conspiracy theories). Please continue to be forthcoming, it’s appreciated.

    73. Lorne Shepherd says:

      Re: #4 – “Spaghetii is best served with marinara, not as a theory to code.” Spoken like a true end-user, Dirk. 😦

      I’m Director of IT Operations for a large organization in RL, over 30 years of experience, so I have some feel for the size of this problem. Good jobs, Lindens. Thanks for the update, and kudos on solving the problem.

    74. The XO says:

      Nice work guys, hats off to you!!!! MySQL database crashes are a pain in the backside – big time!

      This in conjunction with different versions on different hosts on the clusters, while keeping the grid running (even minimally) plus pushing out a code update, along with modified queries/settings for the different versions of MySQL.

      One word: awesome! I can’t believe you managed to keep the grid up during this process. If it were down to me, I would have closed it and concentrated on repairs.

      To be honest, it wasn’t too bad anyway. A few relogs, a trip between sims… a bit of patience and seeing the odd grey texture… apart from that nothing I couldn’t live with on a temporary basis. Once you’ve been in SL a while you’ll pick up the “tricks of the trade” in order to make things work during problem times. They’re infrequent and can be lived with.

      Once again, nice work guys!

    75. Slovar Flossberg says:

      I only wanted to thank you for this quite interesting read. A few month ago I have been active in a department that had to deal with big databases which contained critical information (banking business) and I can imagine your situation.

    76. Chrysala says:

      Now this was a serious blog entry.. the kind of rundown of what happens that we need. Would be better if it were play-by play rather than post mortem, but nice to see anyway! Can we get this kind of information flow on all issues currently of concern? Might help that dropping retention rate. Anyway, Congratulations on defeating this particular monster!

      93/93

    77. Lord Humphrey says:

      I too welcome this informative blog post! It makes a change, and a change for the better!

      I do think however (before you all sit back and relax too much 😉 ) that you need to build on this and now inform us all what you intend to do to compensate us for our two days lost classifieds!

      If you charge in world for a service you need to be able to deliver…..or compensate if you don’t. Heck….if my phone service goes down not only do they not charge me but they pay me for every 24 hours that it is unavailable.

      I was about to make an analogy with my own in world business then, but considered this would amount to little more than free advertising on the blog …….. does my classified fee cover that? 😉 😉

      Thanks for the post mortem……now lets see what we get out of the wake 😉
      Best wishes!

    78. Angel Horner says:

      Great post guys and great work!

      @46 The update today is to repair problems to make it more stable, hardly any new features are being added. Suggestion: Read the notes before you blurt out uninformed opinions. It just makes you look ignorant.

    79. Scarlett Glenelg says:

      Good post ! Thanks ! But … omg what technology ….
      (As an Oracle DBA on *IX systems in rl) … im flabbergasted about LL accepting database servers crashing repeatedly on whatever basis …. if im not wrong this is 2007 AD ….

    80. Beezle W. says:

      Ooo, a blog entry that actually tells us what’s going on?

      More, please.

    81. ari blackthorne says:

      HAH! There are *still* a few whiners and complainers – sheesh – you whiners go make your own system. LL knows what they are doing. If you don’y like what they use and how they do it – don’t talk about it – do it yourself.

      I am happy to see the majority here are giving kudos!

      Yeah – me, too. Cheers LL. Even though you didn’t post to the blog, I was one of the *few* who still had faith in you guys!

      Too bad it is so easy to be negative and most posts on this blog are the same old whiners, whining about the same old things, over and over again.

      So, to all you whiners (most of whom I don’t see in this article – go figure) – kwitcherbitchin and move on. LL knows of the problems and they want it all fixed as badly as you – all of us do 😛

      Now, if only there were more votes for the texture alpha layer problem…

      LOL

    82. Jenny Carlos says:

      Well not so sure about if its really “Fixed” or not but I would think that updateing the software cant hurt and probably in most cases improves things.
      Good to see they are doing some things to improve and lets hope they didnt find someway to botch that up too lol.

      Thanks for your efforts.

      BUT lets not upset us all with this update your doing now 🙂

    83. Kathy Amsterdam says:

      Why exactly are you running anything on MySQL?? Use your cashflow and upgrade to SQL, ORACLE or even Linux. Regardess of what they pump on their website…they are NOT the most reliable. Its a well known fact in the development community that it has a breaking point….and I think you have found it.

      Now…when are you going to fix the high packet loss in every region?

    84. Merchant of SL says:

      I am impressed LL. I, for one, would like to see more of these detailed reports… Maybe as the issue is occuring instead of days later after my sales are suffering and i’m cursing your names… lol But it’s a step in the right direction. I look forward to the update today to fix some of the bugs, mostly because I need to clean, but some of the bugs you’re fixing today are much needed! Thanks again!

    85. Ann Otoole says:

      @55… Scarlett I too am a high end Oracle resource. But I cannot imagine what we would have to pay for end user licenses to use Oracle. There is no way Linden Research could afford a real database system without direct pass through of the licensing and support costs to the end users.

      humor()
      {
      So they use a freebie.
      One with full permissions no doubt.
      Perhaps we will see SL in a BIAB soon.
      }

    86. Celty Westwick says:

      I too am glad the problem is resolved. However, I do believe LL should recompense those who pay real money for classifieds that were not delivered for two days.

      If you paid for an ad in any media venue and it was not run you would either have it run later or be refunded your money. Those ads have gotten more and more expensive, and simply ignoring that you did not provide what thousands paid for is hardly acceptable.

      The direct cost of the ads alone for some of the higher payers was over $100 U.S., and few can afford to simply eat that loss. But the true cost was not only in undelivered ads but much more in untold lost sales as a result. I’m not a big SL tycoon, just a small, relatively new business owner (7th month), but regardless of scale, the impact is significant in trying to meet expenses and have some margin left at the end of the month.

      This is a request for LL to realize that equitable business practices are necessary for SL to succeed in the long run, to keep the confidence of users and avoid people seeking remedies in the courts, as is already occurring in other matters. Fairness and following basic business principles is not too much to ask.

      Problems and failures in systems do occur, but when you do not deliver what is paid for, it’s incumbent on a business to accept responsibility and make it right.

    87. Lucius Obviate says:

      @4

      Perhaps I am missing something but I fail to see how your post was anything more than the same petty griping that usually accompanies blog entries. We are all certainly entitled to complain and owning two businesses in SL. It irratates me that search was down however, having knowledge of the industry I know full well how difficult to maintain large scale databases can be. When that service rep puts you on hold while verifying your information because your ISP is having service issues it isnt because he is looking up your issue. It is more than likely because the database is being slow to respond or otherwise being completly unresponsive at the time. The simple point is that while its all good to complain and point out shortcomings, unless your actually offering something constructive no matter how well spoken you are it is nothing more than the typical “zomg LL you sux0r” type posts.

      @30

      Linden Labs owes you nothing. Taken the time to read the TOS of the company you are demanding reimbersment from before you do. If there is no mention in the TOS of such reimbersment for outage of service then your claims while economically sensible are unfounded and have nothing to support them. There are people who spend tens of thousands of L$ to have their classifieds posted and still you are the only person here demanding payment. The long of the short is you are not entitled to a single L$ of reimbersment and furthermore your general attitude of condescension is not one that would persuade someone to want to help you especially when it is not under any circumstances required.

      @48

      Thank you for illustrating a text book definition of the kind of post reffered to in my response to #4.

    88. Thanks for the explanation. Definitely appreciate the insight.

    89. sirhc DeSantis says:

      Yeah add my $0.02 ( and isn’t it quiet here for a whinesday ). thanks for the post mortem details LL. a lot of us are wireheads and understand what you went through. but please do this more often. if i know which parts/subsystems of the world are borked i can still play – even just practicing scripting/building/whatever. next time let us know as it happens. no need to keep comments open, just a blow by blow as you take a quick smoke break (or whatever). still sterling work though 🙂

    90. Vic says:

      Thanks LL for a full update on what happened – it was very worrying. I hate to agree with the normal ‘we pay for this service’ moaners – afterall, we all know the system suffers instability and yet remain subscribed. However, we did pay for classified ads and parcel directory fees that were unavailable for 2 days. I trust however that you are working on the task of giving pro-rata refunds as you have in the past.

      Please please please can I ask you to work on stablising the system before adding any more new features. New features are wonderful but with the current bugs they are not appreciated. I have to say the best version was 1.14, my group IMs haven’t been the same since 1.15 went live.

      Well done for providing us with SL, bottom line is we all love it – that is why we are all here…

    91. Lucius Obviate says:

      @60

      Again, the TOS does not bind LL in anyway to comp you for anything spent. Why? Because you are not required to spend real money in order to post them. That is a voluntary decision that is made on your part. I make enough money in my business to pay portions of my bills in RL and to upgrade my computer to better support my activties in SL and the other place I go to game and code. I have never once paid a red cent for L$ and I manage to own my own parcel which I maintain a store on, thirteen individual stalls through which I sell my wares. As far as SL goes I want for nothing and yet I manage to not spend a single USD.

      Your complaints are unfounded at best. You are expecting Linden Labs to comp you for a decision you made to voluntarily spend your own money. That would be no different than you coming to me and expecting me to comp you for your “losses”.

    92. Blinders Off says:

      With no intent on being negative, I noticed the following statement:

      ” The vendor recommended updating to a more recent version (MySQL 5.0). That update – including the work to deal with the resulting incompatibilities –has been an ongoing project. Early last week, we completed final testing of the code changes to support this migration. A configuration change was made to the databases to improve full text search results (some existing “stop words” functionality was turned off).”

      In more direct terms if I understand this properly, the vendor recommended upgrading to 5.0.

      Instead of doing that, someone decided to make a database change to “improve full text search results”… which change seems to have resulted in accelerated crashes of the database servers.

      I appreciate you burning the midnight oil rather than just shutting down. But that burning was apparently the result of yet another code implementation without sufficient testing. While I understand there are some tests that cannot be performed other than on the full-bore main grid, still, when someone has already-existing database issues, it would seem that altering code with an already-known-to-be-flaky SQL software version would be a questionable decision.

      A more sensible solution may have been to leave the code alone, and install SQL 5.0, like the vendor recommended.

    93. Dajobu Ling says:

      This is probably my favorite blog. I’m in IT like #1 and definately empathize.

      I’m glad everything went well bringing things back online with the upgrade. Great work. Keep it up!

    94. Lourdes Laysan says:

      Regarding what Celty said in post 60, this explains why most fellow shop owners I had talked to in the last few days were experiencing a sales slow down or stop over this period… I have to agree, that the damage to the economy was noticed, at least by those around me.

      And more so, I feel Celty’s pain, being a shop owner of about the same time period. I’m sure there’s some threshold when this sort of thing is less worrisome for expense meeting, but at the mid ranges where we probably are, it is significant and quite noticeable.

      I don’t know if there’s a function on the back end that extend the sales ad length universally, and if there isn’t that is problematic from a customer point of view (of buying ads).

      What would be more of a nightmare, however, would be going in and refunding two days of the amounts of weekly ad pay for every user -manually-… as this would take time away from code management and user requests and simply be massive. Even if you wrote a program to scan and do it, that coder could have been working on something else that may make the grid better.

      So I don’t know what the solution is to compensate for potentially lost sales if something like this happens again. Minimally a code update to extend the length of an Ad an extra X number of days would probably be the most effective use of time and personnel in my opinion.

      Well those are my two cents ont he issue, a little understanding of the overall issue now and a suggestion for a backend code update. Now, lets get back and read what others are saying while we await the grid come up in a few hours.

      Ciao!

      ~Lourdes

    95. good response Gwyn!!! You are so on point… Mysql has been around and tested and blown up and fixed …. It would be crazy to switch now… Though I made the switch to 5.0 months ago….

    96. Lourdes Laysan says:

      Just a quick additonal;

      Celty’s post is now @85 not 60 as I had posted at the time..(guess others were also commenting in the intervening time.)

      Second I think it should also be noted that we are happy with the owrk put in to stabilize and test the fix before it went live. Overall support of SL by the Linden’s, in every instance I’ve seen, has been superb. And when Linden does occasionally drop the ball (as we all sometimes will) they own up to the problem and admit their mistakes (somethng sorely missing in most other companies in this world).

      Overrall I’m very happy with SL, considering the amount of code that goes into this place I’m constantly amazed it works as well as it does.

      So just so we’re clear, my comments re Celty’s issue, are aimed more at the point of view of a shop owner dealing with a particular issue from a customer of services perspective. On the IT side, a business I have been in often, my opinion is clear, LL is doing a bang up job and providing a place where we can all play, run businesses, extend our real life into the virtual, provide *meaningful* interaction with people all over the world.

      Are there things to be improved… sure. But that will always be the case with anything. Its evolution 😉

    97. Onyx says:

      Is it possible to fix the “search” facility so that it can “find” groups or people whose names contain 2 characters at the beginning? For example, a friend of mine has a group called “U2 Family” when you try to find that group you get a “no results” in search in both group search and all search.

    98. Katrina Bekkers says:

      For once, a very good blog entry, with lots of details, humor, and lack of that “our way or the highway” arrogance we all love to hate.

      Thank you for your informative post, Everett. Better late than never.

      A couple of things left me a bit… Uhm… Baffled. First, is using MySQL 4.1, that’s sadly known as the Bugged MySQL. I guessed you already were on 5.x since ages, given your setup and problems. Well, this gives new hope, at least.

      Second, how you treat crashes. Accepting them is… Well… Absurd, in a mature IT environment. Working around them with a list of “spares” that should have recovered from the previous crash while the live server is going down really leaves me speechless.

      What about a good load balancing frontend? A true replication cluster (MySQL 5.x can do it decently, even if not with the performance and features of – say – Oracle Parallel Server)? Having a line of soon-to-crash failover servers taking care of the just-crashed live server’s queries is IMVVVHO suboptimal at best, and a colossal design sinkhole in practice.

      Well, probably you just worked with what you had. So I stop ranting, suggesting stuff you probably thought about and discarded, or have nobody able to take care for, and renew my thanks for a good blog post in AGES.

    99. Captain Noarlunga says:

      SO CAN SOMEONE TELL US WHY THE PACKET LOSSES ARE SO BAD JUST LATELY PLEASE?

    100. Appreciate the update and detailed report of what went wrong, and how you resolved the problem. Can’t ask for more than that.

    101. Pingback: The Second Life Grid Grind » Blog Archive » Break down of the break down on the blog

    102. Now THIS is the type of post I like to see. THANK YOU!! 🙂

      My Only Question/Suggestion:

      Why wasn’t the “Communications Monkey” (see Torley’s Dec 8, 2006 post) keeping us up to speed on this? I’m assuming LL isn’t using the CommMonkey idea anymore, because if you were, this “oversight” wouldn’t have happened.

      If you have indeed dropped the CommMonkey idea, PLEASE bring it back. As Torley wrote, “In Second Life, communication’s also needed to actively show you we’re listening AND responding, and we treasure you being here with us.”

      Good luck, Lindens. 🙂

      – – – –

      #84: “Why exactly are you running anything on MySQL?? Use your cashflow and upgrade to SQL, ORACLE or even Linux.”

      Uhmm, what?! Obviously, Kathy, you have NO CLUE what you’re talking about.

      For starters, THEY ALREADY RUN LINUX!!! =P

      MS SQL (I’m assuming that’s what you’re suggesting when you suggest that they “upgrade to SQL”) is PAINFULLY SLOW, even in small deployments. I don’t want to think about how it would perform on the scale of SL, not to mention that MS SQL only runs on Windows Server.

      Oracle is good, but extremely expensive; correct me if I’m wrong, but the licensing for something like SL would likely run in the millions of dollars. LL may be profitable now, but not quite THAT profitable.

      Plus, as Gweneth (#56) said, switching to another database engine will introduce a whole new set of problems. First off, they’ll have to re-learn a whole new set of quirks and workarounds, since EVERY database has its bugs and other oddities. On top of that, it’s very likely that just in the process of re-writing the code, new bugs will be introduced as old ones are fixed. In short, it would be a lateral move, at best.

      Just remember, it could be a lot worse. I work in a small medical office whose software uses a MS JET database (essentially a shared MDB file sitting in a dumb file server- all db operations client-side). Talk about fragile and SLOW, even on out gigabit network. It breaks a lot. XD

      Take care!

    103. Whoops, that should be directed at post #83, not #84. 😛

    Comments are closed.