This article is based on research we did on High Reliability Organizations (HROs) for our Testing and Finance 2012 talk. It intended as a starting point for those interested in learning more about HROs. If nothing else, it provides an extended reference section for further reading and suggests some useful web resources.
What is an HRO?
HROS are organizations that succeed in avoiding catastrophes in an environment where accidents can be expected due to risk factors and the complexity of the system. Consider, for example, the organization behind oilrigs where it is easy to imagine someone getting injured due to a mistake. What starts out as a small mistake may quite literally explode into a big problem. As Tim Harford notes:
On the morning of 6 July 1988, maintenance workers on Piper Alpha, the largest and oldest oil and gas rig in the North Sea, dismantled a backup pump to check a safety valve. The work dragged on all day and the workers stopped early evening, sealing the tube off and filling out a permit noting that the pump was unusable. An engineer left the permit in the control room, but it was busy and there were constant interruptions. Later in the evening, the primary pump failed and – pressed for time, not knowing about the maintenance, and unable to find and reason why the backup pump should not be used – the rig’s operators started up the half-dismantled pump. Gas leaked out, caught fire and exploded.
The explosion, serious in itself, was compounded by several other failures. Normally, a gas rig such as Piper Alpha would have blast walls to contain explosions, but Piper Alpha had originally been designed to pump oil, which is flammable but rarely explosive. The retrofitted design also placed hazards too close to the rig’s control room, which the explosion immediately disabled. Fire-fighting pumps, which were designed to draw in huge volumes of sea water, did not automatically start, because of a safety measure to protect divers from being sucked into the pump inlet. The safety system could have been overridden, but the control room had been destroyed. This also meant no evacuation could be coordinated, so the platform workers retreated to the rig’s accommodation block.
Two nearby rigs continued to pump oil and gas towards the blazing Piper Alpha. Their operators watched the inferno but fretted that they lacked the authority to make the expensive decision to shut down production. It might have made little difference anyway, given the presence of so much high-pressure gas in the supply lines. When this gas exploded, a fireball half the height of the Eiffel Tower engulfed the platform. The blast even killed two rescuers in a nearby boat, along with the rig crewmen whom they had hauled from the water. Other pipelines ruptured in the heat, feeding the fire and driving away another fire-fighting rescue boat. It was impossible to approach the rig and, less than two hours after the initial explosion, the entire accommodation block slid off the melting platform into the sea. One hundred and sixty-seven men died. Many of the fifty-nine survivors had leapt ten storeys into deadly cold waves. The rig burned for three more weeks, wilting like old flowers in a betrayal of mass, steel and engineering.
Each mistake was not enough to bring Piper Alpha down. But the cumulative effect of each mistake created disaster. The workers stopped work on the valve to avoid getting too tired as this might lead to mistakes (respecting company regulations). Regulations to keep staff from getting tired seem very sensible. However, following the rules meant that the team did not adapt to the situation. For example, they could have called for another crew to take over or continue until the work was finished. Whatever their solution, surely the crew doing the work had better information on the current job and more technical knowledge about the pump than the offshore management who set the rules? The crew couldn’t receive orders from the mainland in real time – there was no manager saying ‘stop work’. The orders were gathered into rules and procedures. These rules are dogma but dogma that was designed to save lives, not end them. Ironically, and of course tragically, the team could not adapt because of the rules but the rules did adapt because of the disaster.
Once the small procedures started to accumulate, damage control procedures were shown to be outdated or even impossible to execute. Setting guidelines and promoting safety is good practice. However, it is essential that the actual people doing the work get involved. They know the environment best. With the help of experts they can develop rules and procedures but, on their own, they are not enough.
In 1977 the worst crash in aviation history occurred on Tenerife. This, like Piper Alpha, was also because of an accumulation of problems. The timeline went like this:
- KLM 4805 and Pan Am 1736 were diverted from Gran Canaria after a bomb exploded at the airport
- As the authorities searched for more bombs, more and more flights were diverted to Tenerife
- These extra planes blocked the taxiways
- As the two planes taxied a dense fog came down so that visibility was reduced to 300 meters, which is 400 meters less than the legal limit
- In the tower, the air controller used the term “OK” which was not standard
Together these things led to the disaster. Ultimately, the system became blocked and the physical problems were compounded by human error. This is why high reliability is all about people and empowering those close to the problem to act.
In the 1960s Roberts, Simon, March, and Weick started thinking about organizations as entities for conducting complex operations. This was in contrast with the rational machine thinking that was popular at the time. They based their ideas on the Toyota Production System and suggested structuring organizations in order to support the operator, or, as in the case of Piper Alpha, the crewmen doing the work. They also suggested that those who have operational expertise and are close to the work should be empowered and supported, so they can solve any problems that may arise. Later, reflection was added as a key ingredient through working in cycles and reviewing each cycle. This thinking would eventually lead to the current literature on HROs.
How relevant is this literature for your organization? This is an essential question that needs to be asked before any organization embarks on a campaign of change. Stralen (2008) has a remarkable story of a nursing home wanting to transform into a chronic intensive care institute.
Risk awareness became the first concept for bedside staff to learn because of the state’s concerns regarding safety. During individual and group sessions, the SCF1 staff demonstrated a lack of belief that they provided high-risk care, worked in a dangerous environment, or that their clients could die. To remedy this, the new medical director invited everyone to go to the parking lot for a two-to-three-hour picnic. This stunned them, and they began to explain why they could not participate—that a child could die from plugging of the tracheostomy airway, fluids entering the lungs, or falling out of bed over the guard rails. This awareness that children in the SCF could die suddenly helped to introduce methods of decision-making that would allow bedside staff to immediately engage a problem.
Risk awareness alone does not lead to reliability: it must change behaviors to acknowledge that risk. Later on, bedside clinical discussions helped staff to link risk with clinical interventions. One can evaluate risk as a probability: the odds an event will occur, or a possibility: the ease with which an event will occur. The concept of possibility facilitated a discussion of ambiguous or vague risks containing great threats.
With hindsight, it is easy to spot the need for reliability. Engineering teams are always part of a bigger system and, if the engineering team makes a mistake, this can bring construction to a halt. If a financial engineer releases faulty software it can stop a bank in its tracks causing great economic loss and unnecessary panic in the economic system. If the booking system at an airline crashes, the company might never recover.
No matter in which branch of engineering you are, the tiniest mistake might have grave consequences, as Piper-Alpha showed. HROs know and accept this. Weick & Sutcliffe(2007) identify 5 characteristics that HROs share to cope, namely:
- Preoccupation With Failure
- Reluctance To Simplify
- Sensitivity To Operations
- Commitment To Resilience
- Deference To Expertise
The five characteristics make organizations agile and alert, where the first two focus on anticipation and the others on containment. Each characteristic is very well defined in this Wildland fire paper2:
Preoccupation With Failure
HROs treat any lapse as a symptom that something may be wrong that could have severe consequences. Several separate small errors that happen to coincide could signal a possible larger failure. HROs encourage reporting of errors and analyze experiences of a near miss for what can be learned. They are wary of the potential liabilities of success, including the temptations to reduce margins of safety and act on autopilot.
Reluctance To Simplify
HROs resist the common tendency to oversimplify explanations of events and steer away from evidence that goes against management thinking or suggests the presence of unexpected problems.
Sensitivity To Operations
Information about operations and performance is integrated into a single picture of the overall situation and operational performance. Sensitivity to operations allows for early problem identification, permitting action before the problems become too substantial.
Commitment To Resilience
HROs recognize, understand and accept that human error and unexpected events are both persistent and omnipresent. Assuming that the organization will eventually be surprised; the capacity is developed to respond to, contain, cope with, and bounce back from undesirable change, swiftly and effectively.
Deference To Expertise
The loosening of hierarchical restraints enables the HRO to empower expert people closest to a problem. Often lower-level personnel are able to make operational decisions quickly and accurately. Leadership is shifted to people who currently have the answer to the problem at hand.
Having a focus and preoccupation with failure creates the opportunity for prevention. To be effective in practice, an organization or team has to be able to spot errors and reflect on their cause and effect.
Small errors might be particularly hard to detect. Having a system in place to stop and reflect is a very effective way of raising a team’s chances of sensitivity to operations. If a retrospective review of the work is done methodologically it can help strengthen and improve all of the five characteristics that define a HRO. The Goodyear fire department knows this and has made reflection part of their operations. Consider this quote from a report3 describing the practices of the Goodyear Fire Department:
On a summer day in Goodyear, firefighters respond to a house fire. Some of them enter through a back door, some through a front door and others climb to the roof where the house must be vented. It is midday and the temperature is near 100 degrees. Electrical power lines are feet from the house. Smoke is pouring out.
This is a routine call, one this crew and others just like it will respond to several times a week. After the crew has extinguished the fire, though, they will do something a little different. Before loading their trucks to return to the station, they pause. “This is the time to review what went well, what can be improved, or what didn’t work at all,” says Goodyear FD Chief Russ Braden.
If your team isn’t doing this kind of retrospective analysis, you are not using one of the best tools to detect problems. Retrospectives are an excellent way to build up data on your project, team and the interaction with wider organizations.
Without reflection there can be no acts of improvement. If improvement is lacking, an organization can only start functioning less effectively over time, there is no status quo. Eventually, without reflection and acting, the only possible outcome is failure.
Teams need to develop a reluctance to simplify solutions during their retrospectives. They must learn to be diligent. They should not skip a retro because nothing out of the ordinary happened. Remember, Chief Braden reviewed a routine and uneventful call. Making reviews and retrospectives part of the procedure helps, but truly great retrospectives only happen in safe environments when there is trust within the group. One technique that enables trust, safety, and some retrospective analysis, is the ‘five whys’. These were introduced at Toyota and incorporated into its production System by Ohno who believed it to be a valuable tool for unveiling the nature of the problem and finding its solution. If this basic tool doesn’t do the trick, go deeper using techniques like root cause analysis, the fault tree or even the theory-U approach by Scharmer.
The most important thing is that you find a way to learn together as a team how to spot small problems that can accumulate into a catastrophe.
Pushing down responsibility to act to those doing the actual work, and deference to expertise, are difficult in a strict ‘command and control’ organization. Many of the organizations where small mistakes can cause a catastrophe are organized in this manner. One organization that has adapted is the US Navy - an archetype for a hierarchical organization, but the flight deck of the aircraft carriers is one of the exceptions. They are the high reliability environments. To get a better feel for the operations on board a carrier consider the following quote Wieck et al:
Imagine that it’s a busy day, and you shrink San Francisco Airport to only one short runway and one ramp and gate. Make planes take off and land at the same time, at half the present time interval, rock the runway from side to side, and require that everyone who leaves in the morning returns that same day. Then turn off the radar to avoid detection, impose strict controls on radios, fuel the aircraft in place with their engines running, put an enemy in the air, and scatter live bombs and rockets around. Now wet the whole thing down with salt water and oil.
(Senior officer, Air Division).
Sensitivity to operations is key in a carrier battle group. The term “situational awareness” is embedded firmly in the contextual language of the armed forces. On the bridge of an aircraft carrier, flight, weather, intelligence and the ships data all come together. There are systems to identify aircraft on radar, keep track of inventory, cruise missile controls and much more. The bridge is a single central control center which can provide status of the ship and deviations. It is the place to get big picture of current operations.
Sensitivity to operations means visualizing your environment in order to indicate if any deviations from the expected occur.
The importance of having an up to date big picture of current operations isn’t limited to the navy. In the world of finance, for example, it is just as relevant. Many financial organizations pull together data ‘dashboards’ to get an insight in their current situation. These dashboards answer questions such as:
- Are we receiving all our market data correct and on time? (For example do the Reuters and Bloomberg information match up?)
- What are our margins like?
- What was our production last period?
- What are our outstanding exposures?
This list can go on for a while and depends on the business specifics of the organization. It is clear that sensitivity to operations is also relevant for financial organizations. However, the operations do not limit themselves to pure financial indicators. It is also important to consider if our continuous builds are in good shape. Are all the components we use for our client web portal still integrating? How many bugs need to be fixed for the next release? What are the tasks for the current iteration and what is their status? This list goes on and on. We need lots of information from a wide range of sources to gain situational awareness. Once the information is tailored to your needs be sure to automate its generation. Your company might not be under fire by an enemy air force, but that doesn’t excuse you from having all relevant data analyzed and ready to report your status. Only with a setup like this will you be able to detect a small “lapse as a symptom that something may be wrong”. Visualization is the key. Ideally any visitor to the team could immediately see if things are going as expected or if some deviation of the norm has occurred.
Commitment to resilience is important for any organization. Human error will happen and it’s best to accept this and deal effectively with the consequences. On the flight deck of US carriers an amazing ritual is nurtured. If an engineer loses a tool on the flight deck, everything stops. Everybody regardless of rank helps search the deck. Afterwards, the reporting engineer is literally applauded instead of disciplined. The crew has accepted the fact that tools can be lost and human error occurs. Instead of attacking the engineer, they attack the problem. A screwdriver on the deck can be sucked into a running jet-engine causing mayhem. Finding the missing screwdriver logically becomes top priority. The reporting engineer didn’t mess up by losing a screwdriver; they saved the day by preventing a catastrophe. This kind of thinking will help managers develop people who can solve problems by creating and sustaining an environment where reporting problems quickly is rewarded.
On the shop floor
You might not be working in a dangerous environment such as a flight deck. However, human error will show its ugly face sooner or later within your organization or team. It might be overoptimistic planning, accidentally deleting a file, crashing the continuous build, a fire in the server area or worse. Accept that these things, and many more, can and will happen. The way you deal with them can have a major impact on the team and organization.
So, the next time a team member mentions that they have messed up you should really consider praising them during the upcoming retrospective. The team got the information quicker and knew what they were up against. Maybe there is a chance to correct course or at least communicate the problem in the wider organization. Problems that are hidden seldom sort themselves out and waiting for a miracle isn’t a strategy. Having proper procedure in place can strongly support the drive for resilience, but it has to start with trust within teams, with trust throughout the organization being the utopia. One of the key findings in a report4 on a HRO implementation for wildfire fighters elegantly phrased this as:
In a just culture, there is no stigma attached to speaking up about errors or defects in the system. In fact, people are rewarded and praised for doing so.
It is almost impossible to understand why other organizations don’t shift leadership to people who currently have the answer to the problem at hand, i.e. It is HRO versus hierarchical. Often management methodologies mimic deference to expertise by defining roles. Each role is appointed to those who likely have the answer to the problem at hand. But common sense is best. No roles are available for those completely unexpected events, hence roles alone can never be enough to guard against the unexpected and thus become a true HRO.
HROs break with the division of conception and execution of work by growing expert, self-managing teams. These teams are devoted to learning, although strict processes for working safely are in place they can be changed by the team at any time, if there is a need.
The real conclusion is that it all starts with one thing: trust.
Derby, E, Larsen D. (2006), Agile Retrospectives: Making Good Teams Great, Pragmatic Bookshelf
Harford, T. (2011), Adapt: Why Success Always Starts with Failure, Little, Brown
Ohno, T. (1988), Toyota production system: beyond large-scale production, Productivity Press
Scharmer O. (2007), Theory U: Leading from the Future as it Emerges, the Society for Organizational Learning; 1st edition
Stralen, van D.(2008), High-Reliability Organizations: Changing the Culture of Care in Two Medical Units, MIT Design Issues, Vol 24 No 1. Available from: http://ccrm.berkeley.edu/pdfs_papers/3.09/Design-Issues-PICU-Subacute.pdf
Weick, K, Sutcliffe, K. (2007), Managing the Unexpected: Resilient Performance in an Age of Uncertainty, Wiley and Sons
http://www.wildfirelessons.net/HigherLogic/System/DownloadDocumentFile.ashx?DocumentFileKey=a24e5f28-7b14-41e3-b544-73320fe28f26&forceDialog=0 [update 4/4/2016 no longer online, available on request]
http://wildfirelessons.net/documents/Tailboard_AARs.pdf [update 4/4/2016 no longer online, available on request]
http://www.wildfirelessons.net/HRO.aspx [update 4/4/2016 no longer online, available on request]
1. Sub-acute Care Facility
2. High Reliability Organizing Implementation at Sequoia and Kings Canyon National Parks.
Available here: http://wildfirelessons.net/documents/SEKI_HRO_Case_Study_LLC.pdf/ [update 4/4/2016 no longer online, available on request]
3. Trues, R. – Instant Reply Available here: http://wildfirelessons.net/documents/Tailboard_AARs.pdf [update 4/4/2016 no longer online, available on request]
4. High Reliability Organizing Implementation at Sequoia and Kings Canyon National Parks.
Available here: http://wildfirelessons.net/documents/SEKI_HRO_Case_Study_LLC.pdf [update 4/4/2016 no longer online, available on request]