Level Up Your Reliability Skills: Get Certified! Boost your career now!

Elevate your industry profile at The RELIABILITY Conference.

Sign Up

Please use your business email address if applicable

Uptime Elements Root Cause Analysis

Uptime Elements Root Cause Analysis

Reliabilityweb.com

Uptime Elements Root Cause Analysis

Uptime Elements’ Root cause analysis (Rca) is an overarching term that incorporates a variety of methods and tools to anticipate potential issues, factors, deficiencies or events that could negatively affect the enterprise, its value stream, safety of its employees or compliance with government regulations.

Root cause analysis is part of a more general problem-prevention, problem-solving process and an integral part of the risk management [RI], reliability [RE], loss elimination [OPX] and continuous improvement [CI].

Because of this, root cause analysis is one of the core building blocks in an enterprise’s sustainability and continuous improvement efforts. It is important to note that root cause analysis in itself will not produce any results; it must be made part of a larger problem-solving effort for quality improvement and the knowledge gained from it converted into actions.


Uptime Elements Reliability Framework and Asset Management SystemUptime Elements Reliability Framework and Asset Management Systemreliabilityweb.com


There are many misconceptions surrounding root cause analysis that have significantly limited the value that this process could offer.

The truth of RCA includes:

  • RCA is intended to prevent issues, or events that could lead to a problem, not to do a post-mortem after the fact,
  • It can be used to evaluate business and work processes, procedures and practices to determine inherent risks and losses, as well as physical or capital assets,
  • All causal factors and forcing functions should be clearly identified, not just the presumed root cause, and
  • A formal RCA is not necessary on everything, but the logic and thought processes are viable even on simple problem-solving exercises.


The Root Cause Analysis process

Figure 1 below provides a high-level view of the RCA process. While the primary source or driver for these analyses are to manage risk, losses and sustain reliability. Additional requests can, and often are, made when there is a suspicion that a value stream or work process may be deviating from optimal condition. Regardless of the initiating source, the execution steps are the same.



Uptime Elements Root Cause AnalysisFigure 1: Uptime Elements Root Cause AnalysisReliabilityweb.com



RCA for Reliability Engineering

While a top-level evaluation can, and often is, completed by a reliability engineer, but when a full, formal RCA is required, a temporary cross-function team comprised of the reliability engineer and the stakeholders associated with the target system or process and subject matter experts is needed to provide a broader perspective. These teams are assigned to a specific process, business or manufacturing, and given 1 – 3 weeks to complete their tasks. Team members are assigned specific tasks, such as gather drawings, specifications, etc., to expedite the timeline. The team meet for brief updates and problem-solving sessions throughout the evaluation. Problem-solving sessions should be limited to no more than four hours per session to achieve maximum benefits from participants. This is the maximum duration for high-intensity problem-solving activities. After four hours, mental fatigue begins to diminish productive input.

Detailed Root Cause Analysis Process

Root cause analysis (RCA) is an integral part of the enterprise’s reliability, risk management and loss elimination process. In the normal sequence of tasks in these three crucial areas, it is performed on those value stream and infrastructure systems that are deem critical by the criticality analysis. The normal progression is to start with the most critical and then in descending order until all assets have been evaluated.

The intent of an RCA is to identify incipient issues that might exist in the processes, procedure, practices, as well as the capital assets, that could present a risk, loss or failure that would affect the enterprise or its value stream. Root cause analysis is not limited to capital or physical assets.



It is just as effective when used to evaluate a business or work process, such as planning or scheduling, as when evaluating a tangible asset.


Detailed Root Cause Analysis ProcessFig 2: Detailed Root Cause Analysis ProcessReliabilityweb.com

Detailed Root Cause Analysis Process

  1. Select system or asset: Assets are selected based on the relative criticality ranking as determined by the Criticality Analysis (Ca). Business and work processes are selected based on their importance or because they are suspect by someone in the organization.
  2. Collect information: For physical asset evaluation, this information includes functional specifications, vendor or as-delivered specifications, installation drawings, commissioning documentation, operating and maintenance procedures. For business and work processes, it include value stream and process maps and a detailed description of the expected deliverables from the process.
  3. Design review: This should be the first step in both RCA and RCFA and provides a baseline or benchmark reference that is invaluable for the remaining steps in the evaluation. The design review includes, but is not limited to, mass-balance; single-point failures; materials selection; component selection and sizing; input boundary condition; output boundary conditions; and design operating envelope.
  4. Identify inherent issues: In most cases, the design review will identify inherent issues that may singularly or in combination result is a risk, loss or reliability issue. Each of these should be clearly identified and recorded.
  5. Quantify risks and issues: The list of design issues identified in the preceding step are quantified to better define the severity and probability that they will occur at some point in the future.
  6. Application review: A thorough evaluation of the application is the next step in the process. Based on numerous studies an average of 27% of asset failures are caused by misapplication, e.g. the assets are operated outside their design envelope.
  7. Identify gaps and issues: The application review will mirror the design review and a gap analysis performed to identify and quantify the differences, if any, between the design and application.
  8. Prepare risk assessment: This task pulls together the knowledge gained in the design-application evaluations, evaluates the potential risks and develops a risk assessment that quantifies the accumulative risk and probability of occurrence.
  9. Determine corrective actions: A list of corrective actions that address each of the issues identified in the preceding steps is compiled and ranked by relative risks.
  10. Prepare cost-benefit analysis: A cost-benefit analysis is required before the risk analysis can be submitted to executive management for consideration and approval. The analysis should be developed on a line item basis so that each recommendation can be evaluated independently of the total.
  11. Submit recommendations for approval: The risk assessment, recommended corrective actions and cost-benefit analysis are submitted to the executive leadership team for approval.
  12. Execute MOC process: If approved by the executive leadership team, the proposed changes must go through the Management of Change process for an engineering and financial review as well as to assure all changes are fully documented, BOMs and drawings changed, and any skills and/or training requirements met.
  13. Implement changes: The proposed changes are implemented, preferably in a test or control area, to confirm they will achieve the desired results without creating other issues.
  14. Track changes and results: The RCA is not complete until the desired results have been achieved for any changes made. This step tracks the results as well as monitors for adverse impacts of the changes.
  15. Update MOC documents: If the changes generated the expected results and without side effects, all MOC documents, e.g. drawings, BOMs, work orders, etc. are updated and results annotated. If expectations are not met, the project is routed to to RCA-9.
  16. Institutionalize change: The changes should be once again evaluated to determine their potential in other areas of the organization. If so, the changes follow the established MOC process for implementation.
Root cause analysis (RCA) and root cause failure analysis (RCFA) are disciplined, step-by-step methodology that leads to the discovery of the prime cause or causes—the root cause—of a potential (RCA) or actual (RCFA) failure, product quality, regulatory compliance or other issues that impact performance, cost or reliability. Compare the overviews of RCA (Figure 1) and RCFA (Figure 2).

Effective Use of Root Cause Analysis

Root cause analysis usually cannot be performed sitting in a conference room, office, or in front of a computer. While the process does require working group sessions, as well as individual and group interviews, the heart of the process is gathering factual data that can be used to isolate, identify, and quantify the real reason or reasons that could result in the abnormal behavior of the system or process that is being investigated. To do this, the investigator or team must roll up their sleeves and get dirty.

The RCA process requires a hands-on process of interviews, inspections, testing, and evaluations that can only be done in the plant or in the field. Theoretical evaluations have their place, but to use the RCA process effectively, the investigators must clearly understand the design and operating dynamics of the investigated system, confirm any and all factors, assumptions or hypotheses that may be offered by those involved in the event being investigated.

Effective use of RCA requires discipline and consistency. Each investigation must be thorough and each of the steps defined in the process must be followed. Perhaps the most difficult part of the analysis is separating fact from fiction. Human nature dictates that everyone involved with a business or work process or with an asset-centric system is conditioned by his or her experience and their natural tendency is to filter input data based on this conditioning. This includes the investigator. However, this often causes preconceived ideas and perceptions that destroy the effectiveness of the RCA process.

It is important for the investigator or investigating team to put aside their perceptions, base the analysis on pure fact, and not assume anything. Any assumptions that enter the analysis process through interviews and other data-gathering processes should be clearly stated. If the assumptions cannot be confirmed or proven, they must be discarded.

The practice of RCA is predicated on the belief that problems are best prevented by correcting them or eliminating the causal factors that lead up to them before they happen.

Despite the seeming disparity in purpose and definition among the various types of root cause analysis, there are some general principles that can be considered universal.

Root cause failure analysis is not a single, defined methodology

There are several types or philosophies of RCA in existence. Most of these can be classified into four, very broadly defined categories based on their field of application: safety-based, production-based, process-based and asset failure-based.

  1. Safety-based RCA is performed to find causes of accidents related to occupational safety, health, and environment.
  2. Product or production-based RCA is performed to identify causes of poor quality, production and other problems in manufacturing related to the product.
  3. Process-based RCA is performed to identify causes of problems related to processes, including business systems.
  4. Asset failure-based RCA is performed for failure analysis of assets or systems in engineering and the maintenance area.


The RCA process involves eight steps:


  1. Define the problem: Sounds simple but rarely is. Lack of accurate data and lapsed in the memory of plant personnel often make is impossible to determine the true problem. In most cases, the best available information will identify the symptoms but rarely the problem.
  2. Collect data and evidence: Opinions are great but of no value in an RCFA investigation. Accurate data and other evidence that confirms the problem are not only helpful but needed to isolate the true cause of the problem.
  3. Compare design vs. application data: Statistically, 27% of asset-related reliability problems are caused by operating the asset outside of the designed operating envelope. In many cases, the root cause of a problem can be resolved through a simple comparison.
  4. Identify possible causal factors: Causal factors are not the actual trigger of an event or incident but are actions or inactions that singularly or in combination cause the root cause. An Ishikawa or Cause-and-Effect diagram (also called a Fishbone Diagram) is ideal for this purpose.
  5. Isolate the cause and causal factors: All steps up to this point, lead to this one. Isolating the true root cause from all the possible, even probable, causal factors and root causes is never straightforward or easy. In most cases, a final decision will depend on testing a series of hypothesis and by a process of elimination deriving the most likely cause.
  6. Develop solutions and recommendations: The purpose of the RCFA is to solve and eliminate an event, incident or failure. Developing a viable, cost-effective solution requires a concentrated effort that may require input from QA, Regulatory Compliance and certainly stakeholders. Any final solution must be compliant with management of change (MOC) procedures.
  7. Implement the recommendations: Once the recommended solution has been fully vetted by the MOC process and approved by the executive leadership team, the change can be implemented. The preferred approach is implementation on a small scale to test and verify the solution before any widespread implementation.
  8. Track the recommended solutions to ensure effectiveness: A key part of the preceding steps is development and inclusion of specific performance indicators that will quantify the effectiveness of the solution. Installation of the corrective actions should be closely monitored and evaluated to assure the desired results have been achieved.


Root Cause Analysis Tools

  • Design Review: All designs have some inherent issues that singularly or in combination with those issues may cause an issue, event or failure at some point in the future. The design review evaluates the design for these issues as well as verifying the operating envelope for the asset or system. This envelope defines the boundary conditions that are acceptable for asset reliability and sustainability. The resultant provides the foundation for all other RCA and RCFA methods.
  • Application Review: Similar to the design review but is focused on the installation and boundary conditions of the application. A gap analysis that compare design to application boundary conditions will resolve a significant percentage of asset-related issues.
  • 5 Whys Analysis: A problem-solving technique for discovering the root cause of a problem. This technique helps users to get to the root of the problem quickly by simply asking “why” several times until the root cause becomes evident.
  • Barrier Analysis – An investigation or design method that involves the tracing of pathways by which a target is adversely affected by a hazard, including the identification of any failed or missing countermeasures that could or should have prevented the undesired effect(s).
  • Fault Tree Analysis: An investigation and analysis technique used to record and display, in a logical, tree-structured hierarchy, all the actions and conditions that were necessary and enough for a given consequence to have occurred.
  • Cause Mapping: – A simple, but effective method of analyzing, documenting, communicating, and solving a problem to show how individual cause and effect relationships are interconnected.
  • Cause and Effects Analysis: Also called Ishikawa or fishbone diagram, it identifies many possible causes for an effect or problem and then sorts ideas into useful categories to help in developing appropriate corrective actions. The design of the diagram looks like the skeleton of a fish, hence the designation “fishbone” diagram.
  • Change Analysis: Looks systematically for possible risk impacts and appropriate risk management strategies in situations where change is occurring. This includes situations in which system configurations are changed, operating practices or policies are revised, new or different activities will be performed, etc.
  • Failure Mode and Effects Analysis (FMEA): A technique to examine an asset, process, or design to determine potential ways it can fail and its potential effects on required functions, and subsequently identify appropriate mitigation tasks for highest priority risks.
  • Fault Tree Analysis: This analysis tool is constructed starting with the final failure or event and progressively tracing each cause that led to the previous cause. This continues until the trail can be traced back no further. Once the fault tree is completed and checked for logical flow, it can be determined which changes would prevent the sequence of causes or events with marked consequences from occurring again.
  • Sequence of Events Analysis: Diagrams each step or action leading up to an incident or event. Each step or action is time stamped, identifies any assumptions or contributing factors leading up to the action. The diagram extends beyond the trigger incident or event to map actions taken to resolve the event.

What Every Reliability Leader Should Know

Root cause (RCA) and root cause failure (RCFA) analyses are incorrectly used interchangeability causing much confusion whenever they are discussed. The latter, Root Cause Failure Analysis (RCFA) is more commonly used but provides fewer benefits:

  • Root cause analysis (RCA): RCA is performed to prevent the possibility of an incident, event or failure from occurring. Much as a design failure modes and effects analysis (DFMEA) is used to remove potential or inherent deficiencies during the design process, RCA is used throughout the Operate/Maintain stages to anticipate and prevent causal factors that would or could result in a problem.
  • Root cause failure analysis (RCFA): Uses many of the same techniques and tools are RCA, it is performed only after an incident, event or failure has occurred. Its purpose is to prevent a reoccurrence of the same incident, event or failure and in most cases does not consider other issues inherent in the asset.

Because of the time and level of effort involved, many organizations do not perform formal or complete RCA and rarely RCFA. Instead, use is limited to RCFA when mandated events, such as injuries, environmental excursions, or other regulatory compliance violations force them. In these event, full RCFA analysis is mandated by the regulatory agency and a full, detailed investigation must start within 24 hours following the incident.

Root cause analysis is not a one-size-fits-all methodology.

There are many different tools, processes, and philosophies for accomplishing RCA. As an optimization tool, RCA is exceptionally valuable. Its framework and the analyses methodologies, such as Ishakawa diagrams and PFMEA, are ideally suited to identify inherent causal factors or forcing functions caused by improper design, operation, or maintenance. With this prior knowledge, steps can be taken to prevent the factors and sustain the asset portfolio’s reliability, performance and extend useful life.

Organizations must continually improve processes, reduce costs, and cut waste to remain competitive. To make improvements in any process, failure/problem, including potential failures, it needs to be analyzed using tools and techniques for developing and implementing corrective actions. A variety of methods, techniques and tools are available, ranging from a simple checklist to sophisticated modeling software. They can be used effectively to lead us to appropriate corrective actions. Applying continuous improvement tools can optimize work processes and help any organization improve its results, regardless of the size or type of business environment.

RCA is a process that introduces organizational improvements in many situations, lasting improvements and most importantly, a learning process to follow for thorough understandings of relationships, causes and effects, and solutions. By practicing RCA, we eliminate acting on possible causes and delay a response to the last responsible moment when the actual root cause of an effect is identified.

Key Points:

  • RCA is a problem prevention methodology; RCFA is a problem-solving method.
  • Both use a step-by-step methodology that leads to the discovery of the root cause and causal factors of a potential or actual incident, event, or failure, as well as their causal factors. In many instances the causal factors singularly or in combination and result is other issues or problems that are also preventable.
  • The primary purpose of performing an RCA is to analyze potential problems or events to identify:
  • What can be done to prevent degradation from design operating parameters?
  • How can the causal factors or forcing functions be eliminated?
      • RCA can help to transform a reactive culture into a forward-looking culture that solves problems before they occur.
      • To be effective, RCA and RCFA must be performed systematically; a cross-functional team effort is required with a fixed time limit assigned.

References

Andersen, Bjorn and Fagerhaug, Tom. Root Cause Analysis. Milwaukee: ASQ Quality Press, 2006
Gulati, Ramesh. Maintenance and Reliability Best Practices. New York: Industrial Press, 2012
Latino, Robert J.; Latino, Kenneth C.; Latino, Mark A. Root Cause Analysis: Improving Performance for Bottom-Line Results. Boca Raton: CRC Press, 2002
Mobley, R. Keith. Root Cause Analysis, 2nd ed. Boston: Butterworth-Heinemann, 2002
Mobley, R. Keith. Rules of Thumb for Maintenance and Reliability Engineers. Boston: Butterworth-Heinemann, 2008
Mobley, R. Keith Maintenance Engineering Handbook, 8th ed. New York, McGraw-Hill, 2014
Tague, Nancy R. The Quality Toolbox. Milwaukee: ASQ Quality Press, 2005

Terrence O'Hanlon

Terrence O’Hanlon, CMRP, and CEO of Reliabilityweb.com® and Publisher for Uptime® Magazine, is an asset management leader, specializing in reliability and operational excellence. He is a popular keynote presenter and is the coauthor of the book, 10 Rights of Asset Management: Achieve Reliability, Asset Performance and Operational Excellence. www.reliabilityweb.com

R. Keith Mobley, CMRP, MBB

Download Article