Advertisement
Guest User

Untitled

a guest
Nov 20th, 2018
132
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.94 KB | None | 0 0
  1. # Example Postmortem DD/MM/YYYY
  2.  
  3. **Date**: DD/MM/YYYY
  4.  
  5. **Author(s)**: Engineer 1, Engineer 2. These should usually be the people involved in dealing with the incident and aftermath.
  6.  
  7. **Severity Level/Scope**: Outage level 1-5 as per guidelines
  8.  
  9. ### TL ; DR
  10. Short description everyone in the company can understand about what happened and how is the team solving it on a high level.
  11. ### Impacted Services
  12. - Service X was unresponsive to users in Y region
  13. - Functionality Y was unavailable to all users
  14. ### Technical background
  15. Technical details of the impacted services or resources that failed in order to understand the context in which they failed, if necessary.
  16. ### Root cause
  17. A deep analysis that starts on the surface symptoms of the incidence and goes all the way down to a primary cause. This can be usually done using the [5 Whys](https://en.wikipedia.org/wiki/5_Whys). This investigation should strive to be [blameless](https://codeascraft.com/2012/05/22/blameless-postmortems/) and find the faults in the system that allowed the incident to happen, not the human errors involved. Remember processes fail, not people.
  18. ### Timeline
  19. All times in CET:
  20. - **9:00 am** first thing happened
  21. - **11:00 am** we saw thing X was going on, so we pushed a fix to mitigate this
  22. - **12:00 am** service returned to normal
  23. ### Lessons learnt
  24. - Prevention
  25. - Item1
  26. - Item2
  27. - Detection
  28. - Item1
  29. - Item2
  30. - Mitigation
  31. - Item1
  32. - Item2
  33. ### Action points
  34. | Definition | Team | Type | Priority | Reference |
  35. | ---- | ---- | ---- | ---- | ---- |
  36. | Set up firewall on deployment | platform | prevention | High | n/a DONE |
  37. | Add CW alarm on elevated 499 errors | cx | detection | Low | JIRA Link |
  38. | Restore user data from backup | cx | mitigation | Very High | JIRA Link |
  39. ### Statistics, logs and metrics
  40. Images that describe the latency, downtime, links to log files, metrics that estimate the damage, whatever the author deems relevant goes here.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement