This article does not present original theorems in mathematics or RCM, but is an attempt to consolidate some existing but distinct concepts relating to maintenance periodicity selection, and to provide some guidance on the best way to apply them.
An RCM analysis should result in the optimum failure-management strategy for a piece of equipment or a system. A failure-management strategy means that the analyst will choose whether to prevent a failure mode (preventive maintenance), modify the equipment to preclude a failure mode (alterative maintenance), or simply allow it to occur and accept the consequences (fix when fail). When the choice is made to prevent the failure mode, a preventive (PM) task is chosen. But the tough question after a task is chosen is how often should it be done?
We will attempt to answer that question.
There are a three fundamental types of PM tasks within RCM; tasks performed based solely on time (not condition), tasks based upon condition, and tasks designed to detect a hidden functional failure. Going back to the original DoD RCM manual by Nowlan & Heap, time-based tasks were called Scheduled Discard or Scheduled Rework. These tasks replace or overhaul a component regardless of condition at a specified interval, and are often today called Hard-Time tasks.
In this study we will group these tasks and call them Time-Directed (TD). The TD task can be the most economically punishing, as it doesn't take into account the condition of the component. By replacing an item on a hard-time schedule, we may be giving up useful service life, or we run the risk of the item failing before it is replaced.
When we choose to perform maintenance based upon the as-found condition of the component, we are performing an On-Condition type task, or CD task. CD tasks are typically more economical, if a cost-effective trigger to perform them can be identified.
Finally, when we perform a task to identify whether a functional failure of a component has occurred which is hidden to the operators under normal operations, we are performing what we call a Failure Finding (FF) task. FF tasks are in place to prevent the occurrence of secondary damage resulting from the loss of the hidden function, e.g. testing of a fire alarm, or test starting an emergency generator.
As a result of the RCM logic-tree analysis, when we choose to prevent a failure mode, we will choose to use the TD, CD, or FF (or combination) tasks. Once we have defined our PM tasks, and how they will be performed, we must then decide upon initial PM periodicities or frequencies. This can be a challenging endeavor, because up to this point we have been following a rigorous RCM logic flow. When the RCM analysis is complete, we are not told how to logically derive an initial task periodicity.
One way is to use best engineering judgment or past experience. However, for each type of RCM task, there are mathematical models, based upon statistical distributions that can be used to "engineer" the initial periodicity. It is important to state that there are data requirements and assumptions that must be made in the models. It also takes more time and research to derive a periodicity for each task, so the analyst must balance the importance or risk associated with the PM task and determine whether the effort to derive a periodicity mathematically is worth the effort.
It seems most prudent to use such methods for higher risk or higher cost failure modes and PM tasks. What follows is a description of mathematical models for deriving initial periodicities for each type (TD, CD, FF) of RCM task.
Typically the most costly choice for any PM is the Time-Directed task. Because we are replacing or rebuilding a component based not on its condition, but on a calendar, hourly, or usage basis, we are often giving up useful life for the component. The most practical reason for choosing a TD task is that a CD task is just not applicable or cost-effective. However, RCM tells us that at a minimum there must be a point in the items life where the conditional probability of failure shows a marked increase, i.e. the object has a "wear-out" age.
Typically simple items wear out at a certain age and complex items and systems break down randomly. If the analyst has good data for the failure mode in question, then the data can be analyzed to determine whether the item has a wear-out age. It is important to note that the wear-out age should correspond to a specific failure mode.
One method that some use to pick a TD maintenance periodicity is Mean Time Between Failures (MTBF). The simple method is to divide the total operating time for an item by the number of failures and thus arrive at the MTBF. The MTBF can then be used to set a maintenance interval. This is an approach that, though simple, invites error. Consider Figure 1.
Three different items (A, B, and C) are placed in service and each is operated for 10,000 hours. Coincidentally, each item fails a total of five times over the 10,000 hours. Thus the MTBF for each item is 10,000 hrs / 5 failures = 2000 hrs. If we treat each item equally because they all have the same MTBF, we are missing out on the failure trends that the data show us.
It can be seen by inspection that each item has a different failure pattern. Item A appears to have a random or non-time-dependent failure rate, certainly not a good candidate for a TD task. The non-time-dependent failure rate is best represented by the exponential failure distribution. Item B appears to have a decreasing failure rate indicative of "wear-in." This makes it an even worse candidate for a TD task because we actually increase the probability of failure for the item after each renewal. The only item that displays the failure pattern conducive to a TD task is item C which shows a significantly higher failure rate near the 10,000 hour point.
The following illustration shows graphically how these failure rates appear when plotted as continuous functions.
If the maintainer is thinking about using a time-directed approach to a PM task, he really should be sure the failure pattern supports this decision. To accomplish this, the failure data should be analyzed to determine if there is a trend or pattern in the rate of failure. One way to analyze the failure data for an item is the Weibull method. Weibull analysis has been in use since the 1950s and there are many software packages (some are free or very inexpensive) available that can be used to develop the Weibull plot.
There are also many books and articles written about Weibull that provide a great deal of in-depth information. Weibull is applicable to systems where the component is replaced after failure, rather than repaired in-service. Weibull assumes that the system is as good as new after replacement. If this is not the case other methodologies such as the Power Law Method (developed by AMSAA in the 1970s) are more appropriate.
Weibull is a continuous failure distribution, one of many that can be used to model reliability. Others include Gaussian (normal) distribution, log-normal, exponential etc. What makes Weibull popular is that it has multiple variables that can change the shape of the distribution to resemble these other distributions. It is important to note, however, that Weibull is intended to model specific failure modes. The input parameters to the distribution are, for the specific failure mode, the time-to-failure, and whether the data timeframe ended before all units failed, i.e. the analysis is suspended at a specific time.
So one must know the size of the population of components under analysis, the length of the analysis timeframe, how many and at what time failures occurred. After input of these values into the Weibull software (or on the old-fashioned Weibull graph paper), two distribution parameters will be generated: the shape parameter (η) and the scale parameter (β).
The value of β will tell if you what type of failure pattern you have. If β is much less than 1.0, then you have a decreasing failure rate (burn-in) similar to line B in Figure 2. When β is greater than 1.0, there is an increasing rate of failure, for the failure mode in question. This corresponds to line C of Figure 2. When β is equal, or nearly equal to 1.0, then there is no trend.
The scale parameter (η), in a Weibull distribution is also known as the "characteristic life" of the distribution. It provides a frame of reference of the point at which 62% of failures should have occurred. In this fashion, it could be used as a point for when to perform a TD maintenance task. The higher the value of η, the higher the rate of age degradation is. However, the best way to determine a TD periodicity is to plot the Weibull reliability vs. time (or usage wear) for the failure mode with η > 1.0. Then choose a Pf point, where reliability becomes lower than what is desired to trigger the maintenance.
The PLM process, is similar to Weibull, but is designed for systems that break down, are repaired, and then placed back in service. This continues until the system is either replaced or overhauled. Like Weibull, a β parameter is computed from failure data and the slope of β determines the trend in failures. β values less than 1.0 indicate a wear-in period, values equal or nearly equal to 1.0 indicate no trend, and values greater than 1.0 indicate the item is experiencing wear-out. The PLM methodology can be used to calculate the most economically feasible overhaul frequency for items that wear out. The TD task periodicity is given by:
λ and β are computed values based on failure data and times to fail for the equipment. Calculating these values is beyond the scope of this text, but may be found in the literature. The important point is to use Weibull analysis for failure modes where the component is replaced with a new component upon failure and PLM where the component is repaired and returned to service after repair. Both the Weibull and Power Law methodology are described in detail in literature, some of which are included in the references section at the end of this article.
Failure Finding Tasks:
FF tasks cannot prevent functional failures. They are intended to discover hidden failures that may result in the loss of a protective function, e.g. testing of a fire alarm. Nowlan & Heap described a method to compute the interval for a FF task that relies upon some basic assumptions. First, is that the failure mode must be time independent and thus follow an exponential failure distribution (like item A in Figure 1). Second, the analyst must know the MTBF for the failure mode. Last, the analyst must input a desired Reliability level (in terms of percentage) for the equipment. For example, we could use a 95% or 0.95 confidence in the equipment operating when needed. The equation to use is a variation of the familiar exponential reliability function:
where λ is equal to 1/MTBF and t is time. For a FF task interval we solve the equation for t (the FF task interval) and thus have:
As an example, if we have a switch with an MTBF of 15,000 hours, and a required reliability of 0.95, then solving for t we get a task interval to test the switch of 769 hours, or approximately 32 days of continuous operation. So a monthly periodicity FF task to test the switch will give us a 95% confidence it will work when needed.
Condition Directed Tasks:
Condition Directed (CD) tasks are periodic tests or inspections to compare the existing conditions or performance of an item with established standards to determine the need for a follow-on renewal, restoration or repair to prevent the loss of function. There are two factors that are relevant with respect to choosing a CD inspection interval: the characteristics of the failure mode in question, and the accuracy and consistency of the inspection method.
A CD task only works if a characteristic related to the failure mode is detectable, and it can be measured with accuracy and consistency, and there is a sufficient and relatively consistent interval from detection of potential failure until actual functional failure. This concept is illustrated in Figure 3.
Where n is the number of inspections performed within the P-F interval. If there is a confidence or probability the inspection will identify a potential failure when it exists, we define the probability of success for the inspection as θ. Consequently, the probability of not detecting the potential failure is (1 - θ). If each inspection has a 1 - θ probability of not detecting the P point, then for n inspections, the total probability of not detecting P is (1-θ)n. If the acceptable probability of detecting P is given as Pa, then the minimal acceptable Pa occurs when Pa = (1-θ)n. Solving this equation for n yields Equation 2:
Obviously the equation will not work when θ = 1. Therefore we cannot calculate a 100% confidence in our inspection detecting P, which is in agreement with practical experience. In this way, n is calculated and can be used to determine the required inspection interval (I).
The greater the value of n, the smaller the inspection interval and thus greater confidence in early detection of potential failure. CBM technologies such as online real-time monitoring effectively decrease the inspection interval to seconds or less, and fully satisfy this equation.
The RCM process is the optimum means to establish a scheduled maintenance program. Initial task periodicities often present a challenge to the analyst. For significant failure modes, or failures where we have good data established, initial maintenance periodicities can be calculated reasonably well. However, these periodicities should not remain static. RCM is a living, continuous improvement process and task intervals should be revisited and adjusted based on as-found condition, failure data, and operator feedback. Such feedback is an integral part of an age-exploration program to optimize the cost-effectiveness of maintenance intervals.
References: Nowlan, F. Stanley, and Heap, Howard F., "Reliability Centered Maintenance," U.S. Dept. of Defense, 1978. NAVAIR 00-24-403, "Guidelines for the Naval Aviation Reliability Centered Maintenance Process," U.S. Navy, 1996. Crow, Larry H., PhD, "Practical Methods for Analyzing the Reliability of Repairable Systems," Reliability Edge, Volume 5, Issue 1, Reliasoft Publishing, 2004. National Institute of Science & Technology, "Engineering Statistics Handbook," published online at (http://www.itl.nist.gov/div898/handbook/index.htm), 2010.
Mr. Berneski is a Certified Reliability Engineer with over 15 years experience in reliability engineering and RCM. He is a technical program manager at CACI Inc. www.caci.com