A Risk Threshold Investigation Is Born

T

he risk threshold investigation (RTI) approach was developed by Anthony (Mac) Smith to support the Metropolitan Sewer District of Greater Cincinnati Wastewater Treatment (MSDGC WWT) Division in its quest to improve reliability. This article takes you through the birth of RTI, explaining what RTI is, why it was created, where it fits into a maintenance reliability program strategy, how it is conducted and showing it in action with some actual results.

"The 20/80 systems needed a simple way to find the needle in the haystack"

Introducing RTI

The thought process of RTI began with this question:

“Are there some potential equipment problems or failures in systems that, individually, could be quite serious even though a system, as a whole, is 'well-behaved'?”

Well-behaved systems (i.e., 20/80) typically exhibit a relatively small number of serious problems. Troublesome systems (i.e., 80/20) have, by contrast, many more components, perhaps 50 percent or higher, in their boundaries that could be classified as contributing to showstopper problems. The purpose of RTI is to ensure that while reliability-centered maintenance (RCM) is being conducted on troublesome systems, some serious problems are not being overlooked in relatively well-behaved systems that may never be subjected to any form of RCM analysis.

The RTI process is conducted at the system level, not the component level. This is because it is already known that the 20/80 systems have a failure history that is relatively benign. This is not so with the 80/20 systems, which have a history of “eating our lunch.” The 20/80 systems needed a simple way to “find the needle in the haystack.” Thus, RTI was born.

RTI came to be because the criticality approach was found wanting in several aspects. In one component-based criticality study, a component could collect enough points from its 23 criticality criteria questions to be considered “critical” by its definitions but would not necessarily be a “risk” because the consequences of its failure to the plant are relatively benign. The majority of consequences of component failures are not really risks in the true sense of the word. The idea is to concentrate on identifying the problems that could really be showstoppers and try to avoid them.

In RTI analysis, only seven consequence areas are considered: safety, environment, downtime, operations, regulatory requirement, single point failure and economics.The RTI approach relies on the experience of system expert maintenance technicians, operators and reliability engineers to consider if any specific problem in the 20/80 system boundary could be a risk in one or more of the seven designated consequence areas.

The basic idea is to qualitatively identify potential problems or failures in the 20/80 systems that could realistically occur and result in a problem in one or more of the seven consequence areas and also cross a threshold that is judged to be sufficiently severe to warrant special attention to mitigate or eliminate risk.

How RTI Fits Into the R&M Strategy

In 2011, the MSDGC WWT Division formulated a reliability and maintenance (R & M) strategy and formed teams to improve three key areas:

  • Reliability engineering, with special emphasis on environmental protection and risk reduction using RCM, experience-centered maintenance (ECM), risk threshold investigation (RTI), root cause analysis (RCA) and defect elimination (DE);
  • Planning and scheduling of maintenance work, with emphasis on increasing proactive and decreasing reactive maintenance;
  • Monitoring asset condition, with a focus on internalizing and expanding coverage to all applicable assets using as many predictive, condition monitoring technologies and related capabilities (e.g., machinery laser alignment) as cost-effectively feasible.

The reliability engineering element of the overall strategy is depicted in Figure 1.1

Let’s take a look at each best practice noted in Figure 1.

Figure 1: The reliability engineering portion of the R & M strategy

Pareto Analysis to Identify Systems for RCM, ECM and RTI Analyses – The bar chart at the top depicts the number of failures in the top 10 systems (identified by code numbers) in the largest of seven MSDGC WWT plants. This is done to direct action first on the most troublesome systems. Variations on this approach to rank the systems employ cost analysis, maintenance labor hours, downtime hours, or other measures affecting the organization’s goals and objectives.

The highest bars to the left of the graph, called a Pareto chart, identify the 80/20 systems or bad actors. They represent the roughly 20 to 30 percent of systems that produce roughly 70 to 80 percent of failures. Those bars for systems to the right of the graph show fewer failures and are called 20/80 systems. The majority of those systems, roughly 70 to 80 percent, produce only 20 to 30 percent of the failures.2,3

RCM Analysis for Bad Actor Systems – MSDGC WWT personnel chose classical RCM as the methodology for bad actor systems. They are the roughly 20 percent of all WWT systems where 80 percent of the problems affecting environment, safety and overall plant performance and cost occur. Classical RCM analysis on a major system may take two to three weeks to perform using an experienced facilitator (i.e., a consultant specializing in RCM until an employee gains enough experience to perform the task) and a team of in-house subject matter experts in maintenance, operations and reliability.

ECM Analysis4 – ECM was adopted for the better to well-behaved systems (i.e., the 20/80 systems where only about 20 percent of all problems develop). This was done to minimize the impact of MSDGC WWT personnel engaged in the overall reliability improvement initiative and reduce the financial impact on the plant’s budget for the improvement initiative. Each system requires a quick look, about two to four days, by an in-house, multidiscipline analysis team led by a specialist in facilitation that answers the following questions:

  1. Are current tasks (if any) performed on the system really worth it in terms of applicability (i.e., they work and actually do or can find failure modes) and is each task cost-effective relative to some alternative, such as allowing the asset to run to failure?
  2. Could any of the corrective maintenance events performed on the system in the past five years (more or less) been avoided or mitigated if a proper or applicable preventive or predictive maintenance task been in place?
  3. Can the team hypothesize any failure modes not already covered in the first two questions that could potentially produce severe consequences, such as affecting safety or have outage consequences requiring substantial downtime and/or forced outage for maintenance?

RTI for 20/80 Systems That May Never Be Subject to RCM – MSDGC WWT personnel conduct a short, one day or less, RTI analysis for 20/80 systems on well-behaved systems. Under some circumstances of unacceptable risk due to two or more risk factors, RTI may trigger a more in-depth look using ECM analysis. RTI results are used to prioritize the systems identified for ECM analysis.

RCA for All Systems When There Is Uncertainty – All plant systems are subject to root cause analysis when problems occur that demand attention to mitigate or eliminate them and there is uncertainty about all the contributing causes. MSDGC chose the cause and effect approach to RCA. Courses on the methodology were held for prospective investigation team participants and facilitators.5

RCA is needed because RCM and ECM can’t be expected to identify all the contributing and/or latent factors that result in failures occurring in a system’s life cycle, especially those involving humans. Poorly documented maintenance processes, procedures and work instructions all can be contributing causes to failures even after the most thorough RCM or ECM analyses are completed. New personnel lacking proper training and experience who are rushed into a crisis situation can overlook critical steps in completing repairs, resulting in infant failures shortly after resumption of operations.6 Other causes that contribute to a failure include changes in replacement parts, lubricants and other items from the organization’s supply chain and changes in operating conditions, system configuration, or practices.

Additional reasons beyond the scope of RCM or ECM that deal with asset conditions and system configuration at the time of analysis also need RCA as a tool for mitigating or eliminating significant, unexpected failures after the aforementioned analyses are completed. It is also needed when neither of the analysis methods is performed for whatever reason and a significant event occurs. RCA investigations may take several days and the report of findings and recommendations for corrective action can be lengthy.

Defect Elimination – This is the last tool applied in the MSDGC WWT Division’s maintenance reliability strategy. The methodology and rationale for including DE in addition to RCA is to eliminate known defects caused by aging, wear and tear, careless or poorly executed work habits, changed operating conditions requiring more robust components, or inadequate replacement parts that don’t meet current stress levels present in an asset. DE analysis meetings can be typically completed in a day because they deal with known defects. The report of findings and recommended action(s) is intentionally limited to one page.7

"They (bad actors) represent the roughly 20 to 30 percent of systems that produce roughly 70 to 80 percent of failures."

RTI Methodology

At the point where it was decided to employ RTI, a total of eight classical RCM analyses had been conducted at MSDGC WWT. This was at a rate of only two analyses per year due to many constraints, including, but not limited to, the flow rate of funds available to hire a facilitator and a lack of available man-hours for in-house subject matter experts to serve on the analysis teams. In addition, the implementation of results from the RCM analysis was lagging. Demands for man-hours to attend to other maintenance reliability initiatives being introduced at the same time, including a different approach to asset condition monitoring, was making life difficult for staff personnel who still had to deal with day-to-day workloads of corrective and preventive maintenance. The desire to account for all systems that might affect wastewater treatment goals and objectives in a way that minimized impact of personnel and maximize impact of their efforts led to the decision to look again at how to prioritize staff activity.

The issue relates directly to the subject of risk. Conceptually, risk is thought of in terms of failure (e.g., hardware and/or software) in a plant system and its equipment and is measured in terms of failure probability and failure consequence. Unacceptable risk is considered to be a combination of significant consequence and a realistic chance (i.e., probability) that it could actually happen. Some industries, for example nuclear power generation and petrochemical processing, spend large sums of money to quantitatively define risk factors and then take great care to design and operate plants so as to mitigate the risk of accidents. Many of the same design features are found in the MSDGC and most other WWT plants. These include equipment redundancies, adequate design margins and backup operating capabilities that can prevent personnel casualties, spills and releases of untreated wastewater into the environment.

MSDGC WWT Division personnel had recognized for several years the need to define an effective strategy to eliminate, or at least mitigate, the effects of unexpected failures in the plants and systems they operate. In 2007, the organization employed a component criticality scoring method in an attempt to identify just where such risks and criticalities reside, and to then use that process to specify where selected predictive tasks should be employed to reduce these risks. The component criticality approach used a multidimensional criticality index applied at the component level. Criticality was assessed using a composite weighted score based on the answers to 23 questions about safety, environment, maintenance and operations consequences of failure. A component’s score was then used to assign predictive maintenance (PdM, now also called asset condition monitoring or ACM) strategies from a consultant-provided library. This produced some benefits, as evidenced in Figure 2 by an increase in proactive maintenance for years 2008 and 2009. Proactive maintenance increased from a little over 30 percent to over 40 percent, but then leveled at the higher figure in 2010.

In 2009, management looked into a different process to get at this issue, namely, a reliability program that uses Pareto analysis to identify plant systems (not components) that were the major culprits causing excessive corrective maintenance and system downtime costs, and then applies a combination of methodologies, starting with classical RCM, to define a maintenance strategy to eliminate or mitigate these failures. By the end of 2013, this approach was showing significant additional results, adding more proactive maintenance beyond the earlier one based on component criticality. This is illustrated in Figure 2 by the increase in maintenance labor (hours) that were proactive from just over 40 percent in 2010 to just over 70 percent in 2013.

Figure 2: Results of applying ACM strategies

So, the MSDGC WWT Division used the classical RCM methodology for the 80/20 systems, identified through Pareto analysis of system failures and maintenance labor costs. Risks for the 20/80 systems are addressed using the RTI approach. RTI results may trigger more detailed analysis, such as ECM. Therefore, the RTI approach is considered to be in the RCM family of methodologies because it addresses systems, their functions and specific failure modes that can defeat those functions or cause other safety or environmental problems.8

The decline in failure events requiring corrective action that had not previously been anticipated through the right PM or PdM tasks and overall failure events started to decline, resulting in a reduction in cost of reactive maintenance in excess of $1.2 million in 2011, $528,000 in 2012 and $752,000 in 2013.9

The RTI process utilizes the experience and judgment of a team of experienced technicians and/or first-line maintenance supervisors and an operator to do the analysis. It is facilitated by a specialist in RTI methodology conducting the five-step process shown in Figure 3. These steps mimic, to a certain degree, the analysis employed in classical RCM studies, but stops far short of the comprehensive information collected and the deliberation conducted in classical RCM.

Figure 3: The five risk threshold investigation steps

The five steps of RTI are:

  1. Select the 20/80 system. This can be determined from a Pareto analysis.
  2. For the selected 20/80 system, reach a collective team decision on the boundaries of that system. The purpose of this step is to establish a common team understanding of what is included or excluded during the analysis. Usually, this will be quickly achieved.
  3. List the system functions for the selected 20/80 system. Listing functions helps identify impacts tied to functions with specific consequences worth considering. The functions are listed on a board or flip chart by the analysis facilitator and captured with a camera when complete. This data is used for the RTI report.
  4. List the major components that reside inside the boundary, including instrumentation, if necessary, for control or safety.
  5. The team, using its collective experience, discusses each component inside the boundary to determine whether it had or could have any kind of reasonably possible problem or failure that would lead to one or more of the seven consequences in Table 1. The team tries to answer the question:

Could such a problem or failure be the source of an unacceptable risk in terms of system function?

If the answer is yes, the discussion then turns to whether that consequence would cross an agreed upon threshold measure, acceptable or unacceptable, and if the latter, should be deserving of further attention to eliminate or mitigate the risk before an actual event occurs.

Table 1 – Failure Criteria Areas Matrix

The identification of problems and/or failures and their potential to produce one or more severe consequences is best done by tapping the experience and judgment of ACM/PdM specialists who monitor the system, craft technicians and supervisors who maintain it, and an operator who runs it. Best results will come if the most experienced personnel participate in the analysis.

"...A reduction in cost of reactive maintenance in excess of $1.2 million in 2011, $528,000 in 2012 and $752,000 in 2013"

Actual RTI Results

Case Study of Two MSDGC WWT Systems

System Names:

  • Return Activated Sludge (RAS) at Mill Creek Wastewater Treatment Plant
  • Waste Activated Sludge (WAS) at Mill Creek Wastewater Treatment Plant

Date Conducted: March 19, 2013

Team Members

  • Plant supervisor for liquid stream
  • Mechanical crew leader
  • Instrumentation crew leader
  • Operator
  • RTI specialist and MSDGC WWT reliability engineer (facilitators)

System Boundary

  • Starts With: Thin wall overflow out of aeration tank
  • Ends With: Return pipe at entrance to aeration tank
  • WAS: Output of pumps to secondary thickening and eventually the incinerator

System Functions

  1. Proper return of activated sludge to the aeration tanks
  2. Proper wasting of activated sludge to secondary thickening
  3. Flow signals to programmable logic controller (PLC)

Table 2 provides the results of this analysis.

Table 2 Return Activated Sludge (RAS) and Waste Activated Sludge (WAS) Systems Analysis

System Results

  • The team spent about four hours developing the information summarized in Table 2.
  • From a risk point of view, two unacceptable problems were identified in this system: The WAS discharge valve with medium criticality and the RAS discharge piping to aeration. Both could result in prohibitive costs and system and/or plant downtime.
  • A third potentially unacceptable problem could develop in the manual flow control valves (RAS to aeration tank manual flow control valve, medium criticality index 1200) and bypass valves (low criticality index 450) if a PM task was not initiated to periodically exercise the valves to preclude a failed open condition.

All three problems should be further reviewed by the plant supervisor of maintenance and appropriate technicians, and results reported to the reliability engineer.

Observations of an RTI Team Participant

Eric Stevens, CMRP, CRL

Just like an RCM study, RTI puts all the players into a room and everybody learns information about how the system operates and how it is maintained. Operations and maintenance do not know enough of what the other crafts are doing. Getting everybody together in one room and discussing the functions of the system gets everybody on the same page. At the Mill Creek WWT plant, when we did the RAS system, most of the team did not think that we would find anything that would be a showstopper. We were wrong and found the few issues identified in Table 2. These were addressed in the MSDGC WWT continuous improvement program.
The RTI process is also a great way for a plant to address the 20/80 systems. It is faster and cheaper than doing a full-blown RCM on each system. While a company may want to do a full RCM on every system, that takes time and money. An organization should do an RCM first on the 80/20 systems and in parallel, if possible, do an RTI on the 20/80 systems. Pareto analysis is used to prioritize the systems. RTI quickly points out any assets that could be showstoppers of the entire system. Once the RTI process is done, we could do an ECM study on the systems that had anything that would defeat a function of the process.
Having sat through the criticality process and being asked 23 questions that seemed to focus more on the asset than the function of the system, the RTI process was a welcome change. It directly related to system functions, such as safety, and gives more concrete answers as to which assets are really more critical from a risk standpoint.

References

  1. A version of the chart in Figure 1 was presented by MSDGC WWT and supporting contractor personnel during a workshop at the 2014 SMRP Annual Conference. The strategy depicted was originated by John Shinn, Jr., P.E., who at the time was Maintenance Manager at MSDGC WWT.
  2. Khan, F.I. “ Bad Actor Program.Uptime magazine, Aug/Sept 2019, pp 56-60.
  3. Koch, Richard. The 80/20 Principle: The Secret to Achieving More with Less, Fourth Edition. Boston: Nicholas Brealey Publishing Lt., January 2, 2014.
  4. Hinchcliffe, Glenn R. and Smith, Anthony M. RCM: Gateway to World-Class Maintenance. Waltham: Butterworth-Heinemann, 2003.
  5. The specific cause and effect methodology chosen was that of ThinkReliability, Inc., Houston, Texas. www.thinkreliability.com
  6. Nicholas, Jack. Secrets of Success with Procedures. Fort Myers: Reliabilityweb.com, 2014; See Chapter 7.
  7. Ledet, Winston P., Ledet, Winston J. and Abshire, Sherri M. Don’t Just Fix It, Improve It! A Journey to the Precision Domain. Fort Myers: Reliabilityweb.com, 2009.
  8. Mitchell, John S. Physical Asset Management Handbook, Fourth Edition. Fort Myers: Reliabilityweb.com, 2012.
  9. Cost reductions were included in a handout by MSDGC WWT personnel and supporting contractors, including the authors of this article, for a workshop conducted at the 2013 SMRP Annual Conference.