Analyzing Repairable System Failures Data
Analyzing Repairable System Failures Data
by Ziad Ali Al-Zahrani
Many reliability engineers throughout history have had concerns with the reliability of the repairable system. Several researchers have presented a few calculation or estimation techniques to achieve repairable system reliability. This article explains the mean cumulative function (MCF) as a powerful and easy technique to estimate and monitor repairable system reliability.
When assessing reliability, it is important to make the distinction between non-repairable components and repairable systems. Figure 1 summarizes the commonly used techniques for reliability measurements for both repairable systems and non-repairable items.
Figure 1: Commonly used techniques for reliability measurements
Parametric methods require a high degree of statistical knowledge and the ability to solve complex equations and verify distributional assumptions. These equations cannot be solved analytically and require an iterative procedure or special software. Parametric approaches are computationally intensive and not intuitive to the average person. Special solution techniques are required, along with the ability to justify distributional assumptions, which is rarely done.
Mean Cumulative Function
Given a set of failure times for a repairable system, the simplest graph that can be constructed is a cumulative plot. This plots the number of failures versus the age of the system. A cumulative plot can be constructed for all failures, outages, system failures due to specific failure modes, etc. Likewise, a cumulative plot can be constructed for just one machine, all machines, or for a group of machines in a population. The average of several cumulative plots is called mean cumulative function.
Recurrence Rate vs Age
Since the MCF is the cumulative average number of failures versus time, one can take the slope of the MCF curve to obtain a rate of occurrence of events as a function of time. This slope is called the recurrence rate to avoid confusion with terms like failure rate.
The recurrence rate can be calculated by a simple numerical differentiation procedure that estimates the slope of the curve numerically. This can be easily implemented in a spreadsheet using the slope (Y1:Yn; X1:Xn) function, where MCF is the Y axis and time is the X axis. One can take five or seven adjacent points and calculate the slope of that section of the curve by a simple ruler method and plot the slope value at the midpoint. The rate tends to amplify sharp changes in curvature in the MCF. If the MCF rises quickly, it can be seen by a sharp spike in the recurrence rate. Similarly, if the MCF is linear, the recurrence rate is a flat line. When the recurrence rate is a constant, the data follows a homogeneous Poisson process (HPP), allowing for the use of metrics, such as mean time between failures (MTBF), to describe the reliability of the population.
All parametric methods apply primarily to “counts” data. That is, they provide an estimate of the expected number of events as they are generalizations of counting processes. However, the MCF is far more flexible than just counts data. It can be used in availability analysis by accumulating average downtime instead of just average number of outage events. MCFs can be used to track service cost per machine in the form of mean cumulative cost function. They also can be used to track any continuous cumulative history in addition to counts, such as energy output from plants, amount of radiation dosage in astronauts, etc.
Consider a plant’s failures cost data due to different causes. A, B and C are available and remain so for a period of one year, as depicted in Table 1.
Table 1: Assumption of Plant Production Loss Cost Data Due to Different Causes A, B and C
A cumulative failures cost plot can be constructed based on the data available in Table 1. Figure 2 shows the overall cumulative failures cost, as well as each cause contribution in the cost. Figure 3 shows the cumulative failures and rate of occurrence of failure (ROCOF). It is important to observe these two observations from Figures 2 and 3:
Figure 2: Plant A failure causes cumulative cost plot
Figure 3: Plant A cumulative plant failures and rate of occurrence of failure (ROCOF)
The cumulative cost plot in Figure 2 shows the failures cost spiked on day number 141 due to one failure related to Cause A, which did not add any failures cost after that. Cause B added failures cost in the period from day number 90 to 120 and stopped adding any cost thereafter. Cause C, on the other hand, started adding failures cost after day number 140 and kept adding cost thereafter periodically.
It can be concluded that Cause A was a major cost failure and shall be investigated to prevent the recurrence. However, it did not happen again throughout the year, which may mean that Cause A was resolved already. Cause B stopped adding any failure cost and seems to be stable, so there’s no need to worry about it. But, Cause C is still bleeding and keeps adding failure costs, which need to be investigated and stopped.
- The cumulative failures plot in Figure 3 rises quickly in the period from 100 days to 150 days, which can be seen by a sharp spike in the recurrence rate (ROCOF) plotted in the same figure. This means system reliability degraded quickly in that period and needs to be investigated. After that, system reliability improves in the period from 150 days to 240 days, which is reflected by a decreasing ROCOF. System reliability degraded again from 240 days to 310 days and finally improved thereafter. Figure 3 gives a clear, instant trend of system reliability.
By using Figure 2, one can direct the focus and efforts to the area where more cost impact is coming from, while using Figure 3 will give a clear instant trend of system reliability.
The two observations may not be obtained using other analysis techniques. For example, growth analysis will indicate only the final status of system reliability as improving, but will not show the previous reliability changes. Another example is trending the periodic MTBF (Figure 4), showing a trending up system with no indication of system reliability degrading periods. A production Weibull plot will estimate the process reliability and estimate the cost of unreliability, but will not show the system reliability trend, nor each cause failure cost.
Figure 4: A 180 day time span MTBF trend with a 60 day shift
The analysis of repairable systems does not have to be difficult. A simple graphical technique can provide excellent estimates of the expected number of failures without resorting to solving complex equations or justifying distributional assumption.
The demonstrated example shows the advantages of using MCF and ROCOF over other methodologies to monitor production line reliability. Advantages include, but are not limited to, simplicity, timely reflecting actual system relatability and flexibility because their extensions can be constructed versus downtime, cost, etc. These advantages are not offered by other analysis methods.
- Trindade, David and Nathan, Swami. "Statistical Analysis of Field Data for Repairable Systems." http://www.trindade.com/2006RM-035_draft.pdf