Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- As you may have noticed, we just had about 6 hours of downtime, caused by
- hardware failure. This brought down all of our servers, where running any
- process on them caused them to lock them up, with no illuminating logs or
- diagnostics. This first affected our dev servers, which was annoying, and
- we were in the process of trying to isolate the cause an hour later when the
- same thing happened to the game server.
- We run the game under a debugger, which provides context and logging during
- crashes and issues- this was missing when the servers died, not so much
- 'terminating' as just 'going away'. We spent 20 afterwards trying to
- determine the cause, with a renewed sense of urgency, and all indicators
- pointed to hardware failure. Normally, Amazon Web Services automatically
- detects failures and enables you to (relatively) seamlessly transition onto
- new hardware. This autodetection didn't occur, and still hasn't, suggesting
- a more significant underlying issue.
- At this juncture, we realized that we couldn't guarantee that attempting to
- cycle onto new hardware would produce any better results, or that we knew
- what the scope of the problem was. We decided that the best course of
- action was to start building a parallel set of infrastructure on a
- redundant, partitioned AWS availability zone. This effectively means
- starting over from scratch, with fresh files, security settings, config,
- everything. Slackie, one of our behind-the-scenes 250s, started this
- undertaking and completed it about 2 hours later.
- When I returned home, our dev server had been up and stable for an hour, and
- we copied the game persistence files over. We've brought the game up, and
- the fact that it's been up long enough for me to write this news provides
- some cautious optimism. We apologize for the downtime, and while we have no
- reason to expect that it will occur again, if it does, we'll be better
- prepared. Our redundant instance will cost some more money and will require
- some work to maintain and automate, but will be worth it.
- We plan on having a maintenance reboot in the next few days, and more
- (scheduled) blog posts container better news in a similar timeframe.
- -Duende
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement