Untitled

As you may have noticed, we just had about 6 hours of downtime, caused by
hardware failure.  This brought down all of our servers, where running any
process on them caused them to lock them up, with no illuminating logs or
diagnostics.  This first affected our dev servers, which was annoying, and
we were in the process of trying to isolate the cause an hour later when the
same thing happened to the game server.

We run the game under a debugger, which provides context and logging during
crashes and issues- this was missing when the servers died, not so much
'terminating' as just 'going away'.  We spent 20 afterwards trying to
determine the cause, with a renewed sense of urgency, and all indicators
pointed to hardware failure.  Normally, Amazon Web Services automatically
detects failures and enables you to (relatively) seamlessly transition onto
new hardware.  This autodetection didn't occur, and still hasn't, suggesting
a more significant underlying issue.

At this juncture, we realized that we couldn't guarantee that attempting to
cycle onto new hardware would produce any better results, or that we knew
what the scope of the problem was.  We decided that the best course of
action was to start building a parallel set of infrastructure on a
redundant, partitioned AWS availability zone.  This effectively means
starting over from scratch, with fresh files, security settings, config,
everything.  Slackie, one of our behind-the-scenes 250s, started this
undertaking and completed it about 2 hours later.

When I returned home, our dev server had been up and stable for an hour, and
we copied the game persistence files over.  We've brought the game up, and
the fact that it's been up long enough for me to write this news provides
some cautious optimism.  We apologize for the downtime, and while we have no
reason to expect that it will occur again, if it does, we'll be better
prepared.  Our redundant instance will cost some more money and will require
some work to maintain and automate, but will be worth it.

We plan on having a maintenance reboot in the next few days, and more
(scheduled) blog posts container better news in a similar timeframe.

-Duende