bonsaiviking

How Complex Systems Fail

Feb 27th, 2013
111
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 14.13 KB | None | 0 0
  1. Original PDF at http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  2. Kerning is so bad as to make the document unreadable. Reproduced in text below.
  3. ---------------------------------------------------------------------------------
  4.  
  5.  
  6. How Systems Fail
  7.  
  8. Copyright (C) 1998, 1999, 2000 by R.I.Cook, MD, for CtL Revision D (00.04.21)
  9. Page 1
  10.  
  11. How Complex Systems Fail
  12. (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated;
  13. How Failure is Attributed to Proximate Cause; and the Resulting New
  14. Understanding of Patient Safety)
  15.  
  16. Richard I. Cook, MD Cognitive technologies Laboratory
  17. University of Chicago
  18.  
  19. 1) Complex systems are intrinsically hazardous systems. All of the
  20. interesting systems (e.g. transportation, healthcare, power generation) are
  21. inherently and unavoidably hazardous by the own nature. The frequency of
  22. hazard exposure can sometimes be changed but the processes involved in the
  23. system are themselves intrinsically and irreducibly hazardous. It is the
  24. presence of these hazards that drives the creation of defenses against
  25. hazard that characterize these systems.
  26.  
  27. 2) Complex systems are heavily and successfully defended against failure.
  28. The high consequences of failure lead over time to the construction of
  29. multiple layers of defense against failure. These defenses include obvious
  30. technical components (e.g. backup systems, 'safety' features of equipment)
  31. and human components (e.g. training, knowledge) but also a variety of
  32. organizational, institutional, and regulatory defenses (e.g. policies and
  33. procedures, certification, work rules, team training). The effect of these
  34. measures is to provide a series of shields that normally divert operations
  35. away from accidents. 3) Catastrophe requires multiple failures - single
  36. point failures are not enough.. The array of defenses works. System
  37. operations are generally successful. Overt catastrophic failure occurs when
  38. small, apparently innocuous failures join to create opportunity for a
  39. systemic accident. Each of these small failures is necessary to cause
  40. catastrophe but only the combination is sufficient to permit failure. Put
  41. another way, there are many more failure opportunities than overt system
  42. accidents. Most initial failure trajectories are blocked by designed system
  43. safety components. Trajectories that reach the operational level are mostly
  44. blocked, usually by practitioners.
  45.  
  46. 4) Complex systems contain changing mixtures of failures latent within
  47. them. The complexity of these systems makes it impossible for them to run
  48. without multiple flaws being present. Because these are individually
  49. insufficient to cause failure they are regarded as minor factors during
  50. operations. Eradication of all latent failures is limited primarily by
  51. economic cost but also because it is difficult before the fact to see how
  52. such failures might contribute to an accident. The failures change
  53. constantly because of changing technology, work organization, and efforts to
  54. eradicate failures.
  55.  
  56. 5) Complex systems run in degraded mode. A corollary to the preceding point
  57. is that complex systems run as broken systems. The system continues to
  58. function because it contains so many redundancies and because people can
  59. make it function, despite the presence of many flaws. After accident reviews
  60. nearly always note that the system has a history of prior 'proto-accidents'
  61. that nearly generated catastrophe. Arguments that these degraded conditions
  62. should have been recognized before the overt accident are usually predicated
  63. on naive notions of system performance. System operations are dynamic,
  64. with components (organizational, human, technical) failing and being
  65. replaced continuously.
  66.  
  67. 6) Catastrophe is always just around the corner. Complex systems possess
  68. potential for catastrophic failure. Human practitioners are nearly always in
  69. close physical and temporal proximity to these potential failures - disaster
  70. can occur at any time and in nearly any place. The potential for
  71. catastrophic outcome is a hallmark of complex systems. It is impossible to
  72. eliminate the potential for such catastrophic failure; the potential for
  73. such failure is always present by the system's own nature.
  74.  
  75. 7) Post-accident attribution accident to a 'root cause' is fundamentally
  76. wrong. Because overt failure requires multiple faults, there is no isolated
  77. 'cause' of an accident. There are multiple contributors to accidents. Each
  78. of these is necessary insufficient in itself to create an accident. Only
  79. jointly are these causes sufficient to create an accident. Indeed, it is
  80. the linking of these causes together that creates the circumstances required
  81. for the accident. Thus, no isolation of the 'root cause' of an accident is
  82. possible. The evaluations based on such reasoning as 'root cause' d o not
  83. reflect a technical understanding of the nature of failure but rather the
  84. social, cultural need to blame specific, localized forces or events for
  85. outcomes.1
  86.  
  87. 1 Anthropological field research provides the clearest demonstration of the
  88. social construction of the notion of 'cause' (cf. Goldman L (1993), The
  89. Culture of Coincidence: accident and absolute liability in Huli, New York:
  90. Clarendon Press; and also Tasca L (1990), The Social Construction of Human
  91. Error, Unpublished doctoral dissertation, Department of Sociology, State
  92. University of New York at Stonybrook.
  93.  
  94. 8) Hindsight biases post-accident assessments of human performance.
  95. Knowledge of the outcome makes it seem that events leading to the outcome
  96. should have appeared more salient to practitioners at the time than was
  97. actually the case. This means that ex post facto accident analysis of human
  98. performance is inaccurate. The outcome knowledge poisons the ability of
  99. after-accident observers to recreate the view of practitioners before the
  100. accident of those same factors. It seems that practitioners "should have
  101. known" that the factors would "inevitably" lead to an accident.2
  102. Hindsight bias remains the primary obstacle to accident investigation,
  103. especially when expert human performance is involved.
  104.  
  105. 2 This is not a feature of medical judgments or technical ones, but rather
  106. of all human cognition about past events and their causes.
  107.  
  108. 9) Human operators have dual roles: as producers & as defenders against
  109. failure. The system practitioners operate the system in order to produce its
  110. desired product and also work to forestall accidents. This dynamic quality
  111. of system operation, the balancing of demands for production against the
  112. possibility of incipient failure is unavoidable. Outsiders rarely
  113. acknowledge the duality of this role. In non-accident filled times, the
  114. production role is emphasized. After accidents, the defense against failure
  115. role is emphasized. At either time, the outsider's view misapprehends the
  116. operator's constant, simultaneous engagement with both roles.
  117.  
  118. 10) All practitioner actions are gambles. After accidents, the overt failure
  119. often appears to have been inevitable and the practitioner's actions as
  120. blunders or deliberate willful disregard of certain impending failure. But
  121. all practitioner actions are actually gambles, that is, acts that take place
  122. in the face of uncertain outcomes. The degree of uncertainty may change from
  123. moment to moment. That practitioner actions are gambles appears clear after
  124. accidents; in general, post hoc analysis regards these gambles as poor ones.
  125. But the converse: that successful outcomes are also the result of gambles;
  126. is not widely appreciated.
  127.  
  128. 11) Actions at the sharp end resolve all ambiguity. Organizations are
  129. ambiguous, often intentionally, about the relationship between production
  130. targets, efficient use of resources, economy and costs of operations, and
  131. acceptable risks of low and high consequence accidents. A ll ambiguity is
  132. resolved by actions of practitioners at the sharp end of the system. After
  133. an accident, practitioner actions may be regarded as 'errors' or
  134. 'violations' but these evaluations are heavily biased by hindsight and
  135. ignore the other driving forces, especially production pressure.
  136.  
  137. 12) Human practitioners are the adaptable element of complex systems.
  138. Practitioners and first line management actively adapt the system to
  139. maximize production and minimize accidents. These adaptations often occur
  140. on a moment by moment basis. Some of these adaptations include: (1)
  141. Restructuring the system in order to reduce exposure of vulnerable parts to
  142. failure. (2) Concentrating critical resources in areas of expected high
  143. demand. (3) Providing pathways for retreat or recovery from expected and
  144. unexpected faults. (4) Establishing means for early detection of changed
  145. system performance in order to allow graceful cutbacks in production or
  146. other means of increasing resiliency.
  147.  
  148. 13) Human expertise in complex systems is constantly changing Complex
  149. systems require substantial human expertise in their operation and
  150. management. This expertise changes in character as technology changes but it
  151. also changes because of the need to replace experts who leave. In every
  152. case, training and refinement of skill and expertise is one part of the
  153. function of the system itself. At any moment, therefore, a given complex
  154. system will contain practitioners and trainees with varying degrees of
  155. expertise. Critical issues related to expertise arise from (1) the need to
  156. use scarce expertise as a resource for the most difficult or demanding
  157. production needs and (2) the need to develop expertise for future use.
  158.  
  159. 14) Change introduces new forms of failure. The low rate of overt accidents
  160. in reliable systems may encourage changes, especially the use of new
  161. technology, to decrease the number of low consequence but high frequency
  162. failures. These changes maybe actually create opportunities for new, low
  163. frequency but high consequence failures. When new technologies are used to
  164. eliminate well understood system failures or to gain high precision
  165. performance they often introduce new pathways to large scale, catastrophic
  166. failures. Not uncommonly, these new, rare catastrophes have even greater
  167. impact than those eliminated by the new technology. These new forms of
  168. failure are difficult to see before the fact; attention is paid mostly to
  169. the putative beneficial characteristics of the changes. Because these new,
  170. high consequence accidents occur at a low rate, multiple system changes may
  171. occur before an accident, making it hard to see the contribution of
  172. technology to the failure.
  173.  
  174. 15) Views of 'cause' limit the effectiveness of defenses against future
  175. events. Post-accident remedies for "human error" are usually predicated
  176. on obstructing activities that can "cause" accidents. These end
  177. -of-the-chain measures do little to reduce the likelihood of further
  178. accidents. In fact that likelihood of an identical accident is already
  179. extraordinarily low because the pattern of latent failures changes
  180. constantly. Instead of increasing safety, post-accident remedies usually
  181. increase the coupling and complexity of the system. This increases the
  182. potential number of latent failures and also makes the detection and
  183. blocking of accident trajectories more difficult.
  184.  
  185. 16) Safety is a characteristic of systems and not of their components Safety
  186. is an emergent property of systems; it does not reside in a person, device
  187. or department of an organization or system. Safety cannot be purchased or
  188. manufactured; it is not a feature that is separate from the other components
  189. of the system. This means that safety cannot be manipulated like a feedstock
  190. or raw material. The state of safety in any system is always dynamic;
  191. continuous systemic change insures that hazard and its management are
  192. constantly changing.
  193.  
  194. 17) People continuously create safety. Failure free operations are the
  195. result of activities of people who work to keep the system within the
  196. boundaries of tolerable performance. These activities are, for the most
  197. part, part of normal operations and superficially straightforward. But
  198. because system operations are never trouble free, human practitioner
  199. adaptations to changing conditions actually create safety from moment to
  200. moment. These adaptations often amount to just the selection of a
  201. well-rehearsed routine from a store of available responses; sometimes,
  202. however, the adaptations are novel combinations or de novo creations of new
  203. approaches.
  204.  
  205. 18) Failure free operations require experience with failure. Recognizing
  206. hazard and successfully manipulating system operations to remain inside the
  207. tolerable performance boundaries requires intimate contact with failure.
  208. More robust system performance is likely to arise in systems where operators
  209. can discern the "edge of the envelope" . This is where system performance
  210. begins to deteriorate, becomes difficult to predict, or cannot be readily
  211. recovered. In intrinsically hazardous systems, operators are expected to
  212. encounter and appreciate hazards in ways that lead to overall performance
  213. that is desirable. Improved safety depend s on providing operators with
  214. calibrated views of the hazards. It also depends on providing calibration
  215. about how their actions move system performance towards or away from the
  216. edge of the envelope.
  217.  
  218.  
  219.  
  220. Other materials:
  221. Cook, Render, Woods (2000). Gaps in the continuity of care and progress on
  222. patient safety. British Medical Journal 320: 791-4.
  223.  
  224. Cook (1999). A Brief Look at the New Look in error, safety, and failure of
  225. complex systems. (Chicago: CtL). Woods & Cook (1999). Perspectives on Human
  226. Error: Hind sight Biases and Local Rationality. In Durso, Nickerson, et
  227. al., eds., Handbook of Applied Cognition. (New York: Wiley) pp. 141-171.
  228.  
  229. Woods & Cook (1998). Characteristics of Patient Safety: Five Principles that
  230. Underlie Productive Work. (Chicago: CtL)
  231.  
  232. Cook & Woods (1994), " Operating at the Sharp End: The Complexity of Human
  233. Error," in MS Bogner, ed., Human Error in Medicine, Hillsdale, NJ; pp.
  234. 255-310.
  235.  
  236. Woods, Johannesen, Cook, & Sarter (1994), Behind Human Error: Cognition,
  237. Computers and Hindsight, Wright Patterson AFB: CSERIAC.
  238.  
  239. Cook, Woods, & Miller (1998), A Tale of Two Stories: Contrasting Views of
  240. Patient Safety, Chicago, IL: NPSF, (available as PDF file on the NPSF web
  241. site at www. npsf.org).
Advertisement
Add Comment
Please, Sign In to add comment