These questions provide the framework for all information collection. There are also three additional tools that help organize all the pieces of information. They are a timeline, diagrams/photos and the process maps. A comprehensive investigation requires the collection of all relevant information documented in a clear and coherent format.
This paper provides a basic explanation of each steps and each tool. The objective is to simplify and improve the way individuals and groups investigate and solve problems. First, we’ll cover two important points that apply to every aspect of an investigation: the importance of focusing on principles and being specific in the communication.
Principles are constants. They do not change from problem to problem. Likewise, the cause-and-effect principle is fundamental to all problems. It doesn’t change from one problem to the next. The cause-andeffect principle can be universally applied to equipment failures, supply chain problems, production outages, customer service issues and people problems. By focusing on the principle of cause-and-effect an organization can develop a consistent approach to investigate and solve all problems.
There is a truth to any incident that has already happened. The layout of a town is the truth. Creating a map of that town is truly an objective exercise because the roads already exist in a particular way. The map should match the actual layout of the town, just as the investigation should match the incident that occurred.
Many people think of cause-and-effect as a linear relationship, where an effect has a cause. In fact, cause-and-effect relationships connect based on the principle of a system. A system has parts just like an effect has causes. The equipment downtime was because a part failed. We find that the part failed because of fatigue. The next question is “Why did it fatigue?” and the why questions can keep going. Most organizations mistakenly believe that an investigation is about finding the one cause - or “root cause.” An effect doesn’t have one cause, an effect has causes. The causes reveal different ways that the problem can be solved.
The word analysis means to break down into parts. Failure analysis, problem analysis and root cause analysis all start with a problem which is then broken down into its parts. The parts of a problem are the causes. The more severe the incident the more detail that is added to the investigation.
A common mistake that organizations make in investigations is the tendency to categorize an entire incident into one cause. As an incident is broken down into detail, more and more causes are revealed. Understanding these detailed causes reveals additional ways that the problem could possibly be solved. As the causes get more specific the solutions also get more specific. Problems are not solved in general. Problems are solved when specific action is taken. “The devil is in the details.”
Organizations may try to group an entire investigation into one category. This makes the incident more general, not more specific. The five favorite generalizations organizations mistakenly use are human error, procedure not followed, equipment failure, training inadequate and design. Many groups believe that the end of an investigation has been reached if they can get to one of these five categories. Don’t stop too early – ask two or three more why questions to get more specific information.
This next section covers the three basic steps in every investigation; what’s the problem, why did it happen and what should be done to prevent it?
Step 1 What’s the Problem? – Definition
Everyone seems to know that defining the problem is the first step in an investigation. How this is done varies widely. Some groups write a lengthy problem statement and then debate the wording for 30 minutes or more. Some identify multiple problems within an incident. A facilitator should remember that people see problems differently. When someone states their view of the problem be prepared that someone is going to disagree and offer a different problem. The word problem is problematic because people use it for whatever they see as the “bad thing.”
To accurately define a failure, there are four simple questions: What is the problem, When did it happen, Where did it happen and How were the overall goals impacted? Instead of writing a long problem description, simply answer these four questions in an outline format. Don’t write responses as complete sentences, just short phrases.
The question, “How were the overall goals impacted?” captures the magnitude of an issue. The first question was “What’s the problem?” which is the individuals’ point of view. The organization views the problem as any deviation from the ideal state. For example, a manufacturing company’s overall goals (ideal state) are typically no safety injuries, no environmental issues, no customer issues, no production problems and no excess materials or labor spending.
The goals that were impacted in a negative way provide the starting point for the analysis. Step 2 does not start with what people see as “the problem(s).” The analysis begins with the impact to the overall goals. People see problems differently, but defining every “problem” by how it negatively impacts the goals provides a consistent starting point. Start with the impact to the overall goals to define your next problem.
Step 2 Why did it happen? – Analysis
It’s important to remember in this step that every effect has causes (plural). People may try to identify the single cause of an issue, what is commonly referred to as the “root cause.” There is not a single cause to any incident. There are causes.
The fire triangle illustrates there is no single cause for a fire; there are causes – heat, fuel and oxygen. Controlling any one of these causes will reduce the risk of the fire. Most people mistakenly believe oxygen is a “contributing factor” to a fire, meaning on it’s own it can’t produce a fire. In reality, there is no difference between a contributing factor and a cause. A cause, by definition, is required to produce an effect. Oxygen is required for fire; it is therefore a cause of fire. On its own, oxygen will not produce a fire. Neither will heat nor fuel. Fire requires all three causes, heat, fuel and oxygen. An effect requires all of its causes. The most effective way to communicate the causes of an incident is in a visual format. The cause-andeffect analysis should start with the impact to the overall goals and then ask Why questions moving to right. Why questions take us backwards in time through the failure. Visually breaking down the causeand- effect relationships as the information is collected is the simplest way to document the investigation and its supporting evidence.
The focus of this step is an accurate cause-and-effect analysis to a sufficient level of detail. It is during this analysis step that detail is added to the timeline, the diagrams and photographs are utilized and the specific steps of the processes that were in place are identified. These additional tools are used to collect and organize all available information to ensure the cause-and-effect analysis is accurate.
The facilitator is typically moving back and forth between the different tools and the cause-and-effect analysis as information is collected. A complete analysis starts with the negative impact to the overall goals and captures the causes with supporting evidence.
Step 3 What should be done? – Solutions
The solutions step is where specific actions are defined to prevent the issue from occurring. This step begins once the analysis step is complete. The solutions step breaks into two parts: what are the possible solutions and from those which ones are the best solutions? Possible solutions are collected first so that different ways to solve the problem can be identified. The analysis is objective and based on evidence, while finding the best solutions is subjective and creative.
Each of the causes in the analysis can be challenged with the question. “Is there a way to control this cause?” Ideas come from the people that are involved with the problem. The managers, engineers and supervisors will have ideas. The designers, manufacturers and vendors also will have ideas. And the people that operate and maintain the system or equipment on a daily basis will have ideas. Ask for input - especially ask the people closest to the work. The people that work in the system in question must be part of the problem solving process. They are the ones that actually execute the process. There is a significant amount of brainpower within organizations that is underutilized because we don’t regularly ask for their ideas.
The best solutions are selected based on how effective the solution is and the level of effort required for implementation. The effectiveness of the solution is a function of its reduction on the impact to the overall goals. The level of effort is a function of the resources, cost and time to implement the solution. The possible solutions can be ranked based on effectiveness and effort so that the best solutions are revealed. These best solutions are then part of an action plan with specific owners and due dates.
Organizing the Investigation - 3 Steps & Tools
In the process of defining the problem and determining why it happened (analysis) there are some other tools that prove very helpful and should be part of the incident documentation. Defining the failure and its impact on the overall goals in Step 1 is a very specific set of questions that typically takes less than 5 minutes. During or immediately after defining the problem people may begin offering additional information about the failure – in no particular order. People may offer causes, sequence of events, a process step that was skipped or they may draw a picture describing the layout or part.
Regardless of what people offer it should be captured in the appropriate tool. Some information will be in both the timeline and the cause-and-effect analysis. A diagram may contain a drawing of the part. The timeline may contain some history about the part and when it failed. The cause-and-effect analysis will contain the causes of why the part failed. The facilitator’s role is to keep the group focused on the three basic questions and organize all the information into its appropriate location(s). Following are some notes about the three investigation tools.
- Capture the Timeline A timeline defines the chronological order of occurrences for a given issue. A timeline is also referred to as a sequence of events. The simplest way to create a timeline is in a table format with the headers Date, Time and Description. Each entry on a timeline corresponds to a specific date and time. The entries on a timeline are much easier to collect and read if they are captured as short phrases instead of complete sentences. Long entries are easier to read if they are broken out sequentially and entered below the previous entry. The timeline shows what happened at a specific date and time, but it does not explain why it happened. A timeline is dependent on time while a cause-and-effect analysis is depended on causes (why questions). The timeline entry may be “9:05AM, Valve opened”, but the causes of why the valve was opened are located in the cause-and-effect analysis. The timeline is always a vertical table of information while the cause-and-effect analysis (ThinkReliability Cause Map) branches out in different directions. Larger issues always have a timeline. The background information can also be added to the timeline instead of written as a separate paragraph. Many companies include a background write-up, but the timeline is a simpler format that makes updates and edits much easier to do. The time scale on a timeline can be years, days, hours, minutes or seconds. The time scale can also change throughout the timeline as long as each entry is placed in the proper chronological order. A timeline can be a very effective tool in investigation, but it’s not needed every time. The timeline complements a thorough cause-and-effect analysis, it doesn’t replace it. Many organizations consider a timeline the analysis of the failure – the timeline is their “investigation tool.” Simply identifying the sequence of events doesn’t explain the cause-and-effect relationships which are fundamental to a complete failure analysis. When investigating a problem, a timeline will always have a corresponding cause-and-effect analysis.
- Use Diagrams, Drawings & Photos Visual tools such as diagrams, drawings, sketches and photographs give everyone a common view of the issue. Without these, everyone has their own mental picture of the failure. A simple sketch on paper or a dry erase board immediately provides the group with a picture that everyone can edit, improve, point at and comment on. Don’t overlook the importance of a simple sketch. People are sometimes concerned about their artwork, but even a simple sketch can significantly improve the information exchange. The more detail that’s added to a drawing, the more specific the discussion can be. Mechanical drawings and diagrams from manuals or the original equipment manufacturers are also important during the investigation to improve on the accuracy of a sketch. Photographs can also be very helpful because they can accurately capture consequences of the failure. Photographs provide a huge amount of context and detail in an investigation. A picture can significantly improve the amount of information conveyed about a failure. Digital cameras allow people to take plenty of pictures so that the most relevant can be selected. Digital photos can also be added directly to the electronic record as the investigation progresses.
- Review the Processes Identifying the processes that were in place before the failure occurred is extremely important in order to prevent the incident from occurring again. A recurring problem is symptomatic of not implementing solutions within the processes that created the failure. A thorough investigation must include a review of the processes that produced the failure. Similarly, a mechanic must know how the processes within a transmission work in order to explain how the transmission failed. During the investigation, a clear understanding of the current work process helps explain what specifically led up to the failure. The process needs to be understood so that specific improvements can be made within the process to ensure that the failure doesn’t happen again. All of the action items (solutions) from the investigation are implemented in work processes upstream of where the problem(s) occurred.
A Complete Investigation
The ultimate output of an investigation is the implementation of action items to prevent the failure from occurring. Determining the action items occurs in Step 3, “What should we do?” but the implementation of the action items happens after the investigation is completed. The documentation of the three questions and the additional tools should occur as the investigation is taking place. Documenting the investigation as it is being investigated by capturing all the information pieces makes for a quicker, clearer, more organized, and ultimately more effective, investigation. Consistent application of these tools helps develop a culture that not only investigations problems well, but also prevents them in the first place.
For a free copy of the Cause Mapping Investigation Template in Microsoft Excel which contains a tab for each of the steps and the tools covered in this paper visit our web site at www.ThinkReliability.com