(NOTE: All dates/times are Pacific Daylight Time)
What follows is our post-mortem of the “find” database crashes of last week. Knowing that some Residents want the summary while others want details, both the Crash Obituary and The Full Coroner’s Report follows.
The Crash Obituary
We noticed an unacceptable level of crashing databases on Wednesday evening. In order to repair the problem, we reluctantly disabled all queries. Starting Thursday morning, we upgraded the databases, but we did not provide a status update during the day–an oversight on our part. By Friday morning, the upgraded databases were stable, and we re-enabled search.
The Full Coroner’s Report
We refer to the database responsible for fielding search requests and a few other lookups as the “find” database. In actuality, there are a cluster of database hosts that serve as “find” hosts, with one “find” host active at any time.
In the event of a crash, a new machine becomes active, and the grid continues merrily along its way. Further, these database hosts are used solely for read- only lookups, and never store the primary version of any data. As a result, a single crash is not cause for panic.
Unfortunately, these “find” databases crash on a regular, but infrequent, basis. We consulted with the database software vendors, and the suspected cause was an older version of the database software (MySQL 4.1). The vendor recommended updating to a more recent version (MySQL 5.0). That update – including the work to deal with the resulting incompatibilities –has been an ongoing project. Early last week, we completed final testing of the code changes to support this migration.
Wednesday morning (humming):
A configuration change was made to the databases to improve full text search results (some existing “stop words” functionality was turned off). After opening up the grid again, we noted no negative repercussions. The grid hummed merrily along through peak load (around 2:00pm.)
Wednesday night (concerned):
Starting around 10:30pm, the “find” database started to repeatedly crash. As mentioned above, isolated crashes have been happening for a while. However, the crash rate on Wednesday evening was excessive.
When a crash occurs, a new database host is tapped to become the “find” host, while the “crashed” host enters a multi-hour repair cycle. With the unexpected repeat crashing, we were potentially running out of database hosts suitable for use as “find” hosts.
The options were somewhat limited, namely
By midnight, the safest option was to disable the bulk of the queries (including search) going against the “find” database. The remaining database hosts would have minimal load and the spares would have time to recover. Unfortunately, even with search queries disabled, the active “find” database crashed again at 2:30am, and several times later on Thursday.
Thursday morning (bleary-eyed):
By morning, we had three new options:
Since we planned to upgrade the database software anyway and the required changes had already passed internal testing, we went with the last option, while making sure we had the resources to fall back to either of the first two if necessary. We performed final tests involving some regions on the main grid, then pushed out updated dataserver code to the grid an hour later.
Updating the dataserver affects every simulator machine that makes up the grid, but is invisible to the residents. In parallel, we started to upgrade one of the “find” database hosts to MySQL 5.0, as well as bring more spares with the older code online as backup.
By this point, the team was in a “hurry up and wait” mode– nothing new to report, no changes. Our next step should have been to update Residents with a new blog post. Unfortunately, we failed to update the blog until the next morning.
Thursday evening (expectant):
By 10:30pm, a spare database completed a successful upgrade to MySQL 5, but this was longer than expected. Upon completion , we activated the database as the “find” database for the main grid, and most of the search queries were cautiously re-enabled.
Friday morning (relieved):
Since the “find” database remained stable overnight, the final remaining search queries were re-enabled on the grid.