Advertisement
Guest User

Untitled

a guest
Apr 21st, 2017
71
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.24 KB | None | 0 0
  1. # Chapter 10 - 15 of Site Reliability Engineering
  2.  
  3. ## Possible discussion topics:
  4.  
  5. - Thoughts on this recommendation: Being on-call should strike a balance between quantity (percent of time spent doing on-call activities) and the quality (number of incidents that occurred while on-call).
  6. - Quantity: Spend at least 50% of time doing engineering, no more that 25% of remainder should be on-call
  7. - Quality: If too many incidents occur on a given on-call shift, the SRE will not have time to properly perform the incident response responsibilities such as: root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs. Google found these activities take ~6 hours on average, so there is a max of 2 incidents per 12 hour shift of on-call.
  8.  
  9. - One Emergency Response recommendation was to intentionally break your systems to see if they fail in the way you expect.
  10. - Anyone want to share experiences of doing this at Pivotal?
  11.  
  12. - How can we build a stronger post-mortem culture?
  13. - One suggestion: "In a monthly newsletter, an interesting and well-written postmortem is shared with the entire organization."
  14. - Does CF keep a list canonical list of past incidents?
  15. - Should engineering teams be required to review this list?
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement