At approximately 6.30 am PDT, we noticed a significant concurrency drop. Members of the Operations team identified that there was a serious problem that prevented logins and the ability to fetch assets. Specifically, squid (web caching software) was no longer running on the majority of simulators. The issue was due to a broken file permission. Since all of our HTTP traffic flows through the local squid process, squid not running meant that no asset traffic was running. As a result, auto-save, asset saves and asset loads would not work.
Operations immediately begun a workaround to restore these sims as fast as possible. This resolved a large part of the problem, and assets that had been queued up now preceeded through the proper path without loss. Yet we noticed some failures after the “fix”, such that a small percentage of assets were being lost on certain sims.
At about 9.30 am, we uncovered the actual bug. The operations team fixed the script that starts up and shuts down squid and redeployed the script. This was much more successful, and we had consistent performance with no asset loss. Concurrency began to recover at 8am, and was fully recovered by 11.15 am.
This bug will not recur, and it helped the team to identify opportunities to improve the bug fixing process as part of our ongoing stability efforts.
We apologize for the inconvenience, and we appreciate your patience during the fix.
Have a great weekend!