On Tuesday Jan 30 2024 from 09:06 UTC to 14:29 UTC, PhantomJsCloud's "Public Cloud" suffered an outage that lasted 5hrs, 23 min. This is our first outage in many years. Below is a "Postmortem" writeup of the outage and our steps required to ensure these kinds of outages will not occur in the future.
Impact:
- Public Cloud: outage for 5hrs, 23 min. (Tuesday Jan 30 2024 from 09:06 UTC to 14:29 UTC)
No Private Cloud customers were impacted by this outage.
Events Timeline
-
Unexpected dip in requests: A highly unlikely fluctuation in customer usage of our Public Cloud occurred, causing our system to reduce the amount of servers handling api requests. This may be due to a partial network outage of our ISP, as a few minutes later the request count was back to normal. Our ISP did not log any impacting events.
-
Misconfigured (down)scaling: This dip in requests exposed a misconfiguration of our autoscaling backend, which set the minimum number of servers far too low (10). Until this incident, even during periods of "low" activity our Public Cloud service runs far more servers than this. This drop to 10 servers led to far fewer servers than needed.
-
Overloaded servers: Due to the above, there were too few servers to handle demand. All became overloaded and flagged themselves as such.
-
Overloaded servers misreported as unhealthy: Our heartbeat monitoring system did not properly distinguish between "unhealthy" and "over capacity". Thus these overloaded servers were flagged to be recycled.
-
Perpetual Too Few and Overloaded: Each new server that went online became overwhelmed by requests and were flagged for removal. There was never enough server capacity to clear the request backlog fast enough, causing our entire Public Cloud service to enter a recursive, pathological "too few servers" state.
-
Monitoring detected the outage: Our monitoring system properly tracked the downtime and the first alert was sent at 09:06 UTC. This tracks our ISP data showing the traffic drop corresponding to the outage.
-
Misconfigured On-Duty pager: Because it has been many years since we had a production outage, we did not notice that our Monitoring Provider altered the format of alert emails. The new email format is from a different sender address and has a different format than previously. Because of this, our paging system did not generate the required SMS messages to alert the off-hour support staff.
-
Off-Hour Ops staff did not monitor emails: Our team is based in East Asia and North America. Off-Hours, an outside firm monitors our alert system for critical services outages. Their expectation was that outages are reported via SMS, which did not occur. They did not monitor the support email where the alerts are delivered.
-
Discovery by team: At aprox 14:10 UTC a dev working on a related project noticed the high number of unread messages in our support list. A senior ops staff was notified and after a short diagnostic of our autoscaling system, the workaround of manually increasing server count was made. This resolved the outage, and by 14:29 UTC the Public Cloud service was operating normally again.
Solutions
Ensuring that our issue alert and escalation system works as intended is the highest priority. We have resolved the breaking changes made by our Monitoring Provider. We have enacted monthly alert testing workflows to ensure the alerts work properly in the future. We are still investigating backup alert systems for redundancy.
We have made changes to our autoscaling system so that the pathological state mentioned above does not occur. This was solved by both increasing the minimum capacity, and by adjusting our autoscaling algorithm, such as to more aggressively provision servers.
Work to improve our heartbeat monitoring system is in progress. We have made needed adjustments to avoid this specific scenario, however more analysis on how the heartbeat system impacted this outage needs to be performed, including more elaborate stress testing scenarios.
Since Oct 2023, most of our development resources are focused on a new technologies push, modernizing the underlying infrastructure piece-by-piece. We will take the lessons learned from this outage and apply them going forward.
Conclusion
This is the first outage of PhantomJsCloud in many years. While our distributed infrastructure has strong resilience to failures, our "unneeded" alert workflow without us knowing. We are committed to ensuring a stable, high-performance service, and will continue to update PhantomJsCloud to the latest versions of Chrome while we improve and expand our offerings.
Thank you to those at Novaleaf who worked with us to quickly enact the solutions needed to ensure this class of outage does not happen again.