Advertisement
Guest User

Untitled

a guest
Jun 17th, 2018
84
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.18 KB | None | 0 0
  1. As you may have noticed, we just had about 6 hours of downtime, caused by
  2. hardware failure. This brought down all of our servers, where running any
  3. process on them caused them to lock them up, with no illuminating logs or
  4. diagnostics. This first affected our dev servers, which was annoying, and
  5. we were in the process of trying to isolate the cause an hour later when the
  6. same thing happened to the game server.
  7.  
  8. We run the game under a debugger, which provides context and logging during
  9. crashes and issues- this was missing when the servers died, not so much
  10. 'terminating' as just 'going away'. We spent 20 afterwards trying to
  11. determine the cause, with a renewed sense of urgency, and all indicators
  12. pointed to hardware failure. Normally, Amazon Web Services automatically
  13. detects failures and enables you to (relatively) seamlessly transition onto
  14. new hardware. This autodetection didn't occur, and still hasn't, suggesting
  15. a more significant underlying issue.
  16.  
  17. At this juncture, we realized that we couldn't guarantee that attempting to
  18. cycle onto new hardware would produce any better results, or that we knew
  19. what the scope of the problem was. We decided that the best course of
  20. action was to start building a parallel set of infrastructure on a
  21. redundant, partitioned AWS availability zone. This effectively means
  22. starting over from scratch, with fresh files, security settings, config,
  23. everything. Slackie, one of our behind-the-scenes 250s, started this
  24. undertaking and completed it about 2 hours later.
  25.  
  26. When I returned home, our dev server had been up and stable for an hour, and
  27. we copied the game persistence files over. We've brought the game up, and
  28. the fact that it's been up long enough for me to write this news provides
  29. some cautious optimism. We apologize for the downtime, and while we have no
  30. reason to expect that it will occur again, if it does, we'll be better
  31. prepared. Our redundant instance will cost some more money and will require
  32. some work to maintain and automate, but will be worth it.
  33.  
  34. We plan on having a maintenance reboot in the next few days, and more
  35. (scheduled) blog posts container better news in a similar timeframe.
  36.  
  37. -Duende
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement