Reliabilityweb MTBF – Misinterpreted and Misused

MTBF – Misinterpreted and Misused

ABSTRACT

This paper examines the overall current misuse and misinterpretation of the reliability parameter, Mean Time Between Failure (MTBF), and its many variants. Both in commercial and defense industries, MTBF is often misunderstood and mismanaged, resulting in inaccurate values of product reliability. MTBF is mostly considered representative of a product’s operating or useful life in the field. Unfortunately, this is not always the exact case. This paper will provide a proper definition of traditional MTBF and discuss how this value is generated. In addition, factors that have led to the confusion and erroneous use of MTBF will be considered. In order to highlight the misunderstanding, an overview of previous and current studies will be presented to demonstrate significant discrepancies between predicted, observed, and fielded MTBFs. It is important that MTBF is correctly determined and applied in order to ensure and maintain reliability throughout a product’s lifecycle.

I. INTRODUCTION

Product reliability is important in a variety of modern and complex engineering systems in both commercial and defense industries. As a result, knowledge and proper application of reliability principles is mandatory. The Reliability Engineering (RE) discipline should play a dominant role at the beginning of any program, specifically during the design, test, and integration phases. It also serves an important function during production and maintenance of the product in the field. In general, RE should be a major supporting discipline to engineering, management, and logistics throughout the lifecycle of any viable product.

In general, RE comprises a group of analyses and tests used to determine the product’s ability to perform its functions in the intended environment. One of the major parameters utilized in many reliability tasks is MTBF. Typically, a predicted MTBF is first determined prior to the development of hardware and/or any knowledge of product interconnections (i.e., series or parallel system relationships). Often, all that is available are parts lists and a known operating environment. As the product and system mature during the design phase, the calculated failure rate or predicted MTBF can be more refined. A well-known example of this is initially calculating failure rates based on the MIL-HDBK-217 Parts Count method and then, as design information becomes available, creating a prediction based on a Parts Stress analysis. Note that MTBF and failure rate will be used interchangeably to some extent, given that they are the inverse of one another (failure rate = 1/MTBF).

In the last few decades, RE has emphasized its role throughout the product lifecycle to include research, design, production, and logistics. Hence, MTBF as a general measure of reliability can be considered vital, from initial product trade studies to system operational effectiveness and logistics support. Throughout the lifecycle, the actual product’s MTBF value will change from a simple inherent component value to one that considers the product’s application and maintenance (i.e., from predicted to operational). It is generally assumed that, by continuously updating the predicted MTBF as more information becomes available, it will get closer to the operational MTBF. However, this is not always the case, and further, comparisons between the two are misleading.

A general misunderstanding of MTBF (both predicted and operational) and its application has been a problem throughout the commercial and defense industries. MTBF is still not well understood by the majority of those using the term. This is an extreme concern due to the fact that MTBF, as previously stated, is a measure of product reliability and used throughout the product lifecycle. For example, its importance can be demonstrated as a factor in determining which supplier to use or in logistics planning (e.g., for spare parts of repairable items or replacement of non-repairable items). In addition, MTBF has been used to support part obsolescence, warranty issues, etc. In order for MTBF to be value-added, the term must be used appropriately.

Altogether, an understanding of MTBF is needed to make informed decisions affecting product development and maintenance. Therefore, four basic questions will be addressed in this paper:

What is MTBF?
How is MTBF determined?
How to use MTBF?
Why is MTBF misinterpreted?

Further, a general discussion of MTBF and its misapplications is presented as well as an abbreviated review of MIL-HDBK217 and other more widely used prediction methods. Additionally, data analysis will emphasize the differences between predicted and operational MTBF or any derivatives thereof.

II. BACKGROUND

A. General Reliability

Reliability is defined as a device operating for a specified period of time under certain conditions. In practice, reliability probabilities can be used in evaluating new designs, comparing competing designs, supporting test, tracking reliability growth, etc. Reliability is quantitatively expressed by the well-known exponential equation:

In the above formula, e is the base of the natural logarithm, λ is a constant failure rate, and t is an arbitrary time period. Thus, reliability, according to equation (1), can be defined as the probability that a device or product with a constant failure rate operates successfully during a specified time period. Also, the failure rate must have the same units as t, usually in hours, and as previously stated, is the reciprocal of MTBF. The equation below shows the simple relationship between failure rate and MTBF.

Note that the constant failure rate is only valid during the useful life of the product. Determining the reliability of a product using equation (1) with time t outside its useful life is not acceptable.

Figure 1 shows the well-known reliability bathtub curve, where the useful life is that part of the curve characterized by a constant failure rate. The first part of the curve has a decreasing rate where weaker units die off (infant mortality) and the last part of the curve with an increasing failure rate is considered the wear-out phase.

Fig. 1. Example Reliability Bathtub Curve

B. Failure Rate Prediction

More specifically, reliability predictions are concerned with developing component or product failure rates (λ) used in the above reliability equation. The first reliability prediction method, developed in 1961, was the Military-Handbook-217 (MILHDBK-217). This handbook was developed to predict the failure rates of mature systems and provide a basis of comparison between competing designs.

MIL-HDBK-217 Section 1.1 states, “The purpose of this handbook is to establish and maintain consistent and uniform methods for estimating the inherent reliability (i.e., the reliability of a mature design) of military electronic equipment and systems. It provides a common basis for reliability predictions during acquisition programs for military electronic systems and equipment. It also establishes a common basis for comparing and evaluating reliability predictions of related or competitive designs. The handbook is intended to be used as a tool to increase the reliability of the equipment being designed” [1].

MIL-HDBK-217 is based on a statistically empirical approach that was last updated in 1995. The handbook, originally developed for military programs, provides failure rate models and parameters based on field failure data. There are failure rate models for capacitors, connectors, inductive devices, microcircuits, relays, resistors, semiconductors, etc. Clearly, as of today, the field data utilized in the handbook is antiquated and does not consider newer commercial parts. Additionally, the failure rate is based on inherent component failures and does not take into account other factors such as software, systems design, or manufacturability [1],[2].

More specifically, the failure rate model in MIL-HDBK-217 is of the form in equation (3). These failure rates are expressed in failures per 106 operating hours. The exact part failure model depends on the part type.

The base failure rate (λb) is developed from only electrical and temperature stresses on the part. The other π factors modify the base failure rate depending on the part and application [1].

The handbook specifies two methods of performing reliability predictions. The first is the Parts Count method found in Appendix A. This method is intended to be used during a bid proposal or in the early stages of design. It requires no information on design detail and can be completed in a reasonable amount of time. The information required to complete a Parts Count prediction is part type, part quality, and operating environment. Given this information, one would look up the generic failure rate from one of the tables in Appendix A and multiply it by the correct quality factor. Note that the quality factor used in the Parts Count method is not necessarily the same value when performing a Parts Stress prediction.

The second method is a Parts Stress prediction that is typically performed when the design is near or at completion. This will require a detailed parts list and knowledge of the applicable part stresses. Sections 5 through 23 of the handbook contain the failure rate models used when performing a Parts Stress prediction. Given the more detailed nature of a Parts Stress prediction, its failure rate is usually less than a Parts Count prediction. The Parts Count method tends to error with a more conservative estimate [1]. Other failure rate prediction methodologies have been developed, all stemming from the original MIL-HDBK-217. Table 1 shows a synopsis of the more widely used prediction methodologies in the defense and commercial industries.

TABLE 1: Prediction Methodology Summary

Telcordia SR-332, last updated in 2016, was developed in response to the inability of MIL-HDBK-217 to precisely estimate failure rates in the telecommunications industry. MIL-HDBK-217 was developed for military applications and tended to be more conservative in its prediction estimates. At the time, Bell Labs (Telcordia was derived from the Bellcore standard) was struggling to use MIL-HDBK-217 as applied to its line of products. They decided to develop their own prediction methodology that was simpler to use and also took into account the development of some new technologies, specifically the advances in Integrated Circuit (IC) technology [3].

Another well-known prediction methodology is the PRISM® SoftWare (SW) tool, which was created in 2000 and released its last updated version in 2005. This methodology stemmed from the work done at the Reliability Analysis Center Methodology Key Aspects Developer Date Last Updated MIL-HDBK-217 (1995) original failure rate prediction method utilizing statistical models developed from field failure rates Rome Laboratory 1995 Telcordia SR-332 (2016) simple to use with less factors than MIL-HDBK-217 and uses the gamma distribution for calculating failure rates Bellcore/Telcordia 2016 PRISM® (2000) uses process assesment factors to adjust the inherent part failure rate and can account for thermal cycling and dormancy(SW) Reliability Analysis Center (RAC) 2000 217Plus™ (2015) an update to PRISM with more part models (SW) Reliability Information Analysis Center (RIAC) 2015 FIDES (2009) method developed from field and manufacturing failure rates and incorporates process grade factors Consortium of French military and aeronautical companies 2009 (RAC) in the late 1990s, where new part failure rate models were developed for Plastic Encapsulated Microcircuits (PEM). These PEM models were developed for capacitors, diodes, ICs, resistors, thyristors, and transistors. Additionally, a process grading factor was developed that included more than 450 process assessment questions, of which 129 are required. PRISM® then uses an environmental and operational profile along with the results of the process assessment to adjust the inherent failure rate. The PRISM® SW tool also employs the Electronic and Non-electronic Parts Reliability Databook’s, EPRD and NPRD, respectively [4],[5].

The successor to PRISM® was 217Plus™, developed by the Reliability Information Analysis Center (RIAC) and released in 2006. 217Plus™ added six new models for connectors, inductors, optoelectronic devices, relays, switches, and transformers. Like PRISM®, this methodology utilizes process grade factors to account for system level effects. These effects are Design, Manufacturing, Parts Quality, System Management, Can Not Duplicate (CND), Induced, and Wear-out. Additionally, 217Plus™ provides a basic SW reliability prediction tool [6].

Developed by several French aeronautical and defense companies, the FIDES methodology was first released in 2004. FIDES is not an acronym, but the Latin word for faith and the root of the word fidelity. In 2005, it was approved as a French standard, UTE C80-811, FIDES guide 2004 issue A – Reliability Methodology for Electronic Systems. FIDES was updated in 2010, and in 2011, it was selected as a best practice in the European Handbook for Defence Procurement [7]. The FIDES models were developed with failure data from the aeronautical and military industries as well as from component and sub-assembly manufacturers. The main aspect of the FIDES approach is that it considers technology, processes, and use when determining a parts failure rate. These are considered throughout the product lifecycle, from manufacturing to fielding [5],[8].

III. DISCUSSION

A. MTBF Defined

To be more thorough and complete, a mathematical approach to defining the failure rate is presented. For the sake of brevity, conditional probability or derivations of reliability parameters and corresponding life distributions will not be included. There are numerous resources that cover the mathematics leading to the common equations used in the reliability discipline. Instead, a summary of key formulas is presented.

First, the Probability Density Function (PDF) must be defined. PDF tells us the probability of an outcome. In a PDF, the sum of all the probabilities of a continuous random variable is equal to 1 and is defined by an area under a curve.

Next, we can define the Cumulative Density Function (CDF). CDF is the probability that a random variable takes a value less than or equal to x.

In RE, exponential distribution is the most commonly used failure density. It is concerned with the time until a specific event occurs. Obviously, in RE, that event is a failure. The exponential distribution function is memoryless, meaning the probability of a failure occurring remains the same regardless of the amount of time that has elapsed. Hence, the failure rate is constant during its useful life. Note that the exponential distribution is the only memoryless continuous random variable. The well-known PDF of the exponential is as follows.

If f(t) is the exponential PDF, then CDF and its solution are as below. Given that we are dealing with failure times, the lower bound of the CDF will be 0. Thus, CDF determines the probability of a component or unit that has been randomly drawn from the population and has failed by t.

Further, the failure rate function is defined as R(t) = 1- F(t), and thus, reliability can be expressed as . Furthermore, using the well-known Hazard Function, one can solve for the failure rate (λ) and consequently the MTBF.

The MTBF may also be determined directly by the expected mean of the exponential probability function.

This mathematically derived MTBF can be defined as the average time of failure for all failure times in the population.As can be seen, the use of the exponential equation in RE is derived from the basic definition of probability [9].

In practice, failure rates can be determined by testing products. In a typical test scenario, the failure rate is calculated by dividing the number of failures by the total unit test time. The following equation is used to estimate the product failure rate in these instances.

For example, there are 10 units simultaneously on test for 10 hours and two failures occur at the 5-hour mark, while eight units complete the test with no failures. Then, the failure rate is 0.022 failures per hour, or an MTBF ~45 hours.

Note that the total test time is not 100 hours but the sum of the total operational hours for each unit. Unfortunately, this mistake—i.e., not accounting for the correct unit test time—is commonly encountered in commercial and defense industries.

Given the nature of product testing, only an estimate of the true MTBF can be obtained. Hence, confidence intervals should be utilized when publishing MTBF values determined by the test. For these types of test scenarios, the Chi-Square (χ2) distribution is typically used to calculate the confidence bounds for MTBF. There are one-sided confidence intervals and twosided confidence intervals that can be determined for both time-truncated tests (based on predetermined test times) or failure truncated tests (based on predetermined number of failures). In most cases, the time-truncated lower bound of MTBF provides the most useful information. Hence, determining an MTBF via testing without confidence intervals is not an accurate or feasible task.

MTBF or any of its derivatives can also be determined by collecting and analyzing field failure data. This type of MTBF can be difficult to quantify and is fundamentally based on the definition of “what constitutes a failure” and also on the ability to verify each failure. Different disciplines, industries, or specialties use terms besides MTBF to designate how they define field failures and/or time (calendar or operational). For example, what is deemed a failure for an airborne platform may not be a failure on a shipboard system. Also, the ability to verify failures is critical to the fidelity of determining MTBF or any of its defined variants. If defined failures are not verified, this leads to inconsistent data that produces erroneous failure rates.Therefore, it is key that failures be defined appropriately and are verifiable when determining MTBF or any of its derivatives.

B. MTBF Misinterpretation

Product testing at a component manufacturer is typically performed on a large number of units simultaneously (see equation 10). The failure rate determined from this type of testing can be considered a point estimate, and its value does not apply to the population value. For instance, 10 capacitors are tested and have an MTBF of 100 hours. This MTBF is not the average life or time between failures of one capacitor in the field. It is a parameter to be used with the reliability equation in order to determine the probability that the capacitor will last for a given time period during its useful life. In general, this characterization may also apply to MTBF values determined from other types of testing and field data.

Assume again a capacitor with an MTBF of 100 hours. What is the likelihood that this capacitor will have a failure-free operating time of 100 hours? Using equation 1, it is shown that the probability of a product with an MTBF equal to t is approximately 37% (in this case, t=100 hours). Thus, this capacitor, with an MTBF of 100 hours, has a 37% chance of operating for 100 hours without failure.

Figure 2 demonstrates the relationship between MTBF and reliability for a product with an MTBF of 100 hours. It can be seen that the reliability decreases significantly before reaching 100 hours. Consider the following question: What is the reliability at 10 hours? From equation 1, or approximating from the chart below, it is shown that a capacitor has a 90% probability of operating for 10 hours. Another point that must be made is that the MTBF applies at any point during the useful life, meaning the product can be at the start or end of its useful life, and its probability of lasting 10 hours will be 90% in both cases. In industry, many outside of RE, and even some within, assume that a predicted MTBF is simply a product’s average operating life. This misapplication leads to a misuse of MTBF. Further, a mistrust of MTBF values occurs when comparisons are made to very dissimilar field failure rates.

Fig. 2. Example Reliability Function (MTBF=100 hours)

C. MTBF Analysis & Review

Various studies support the fact that predicted MTBF values are inconsistent across products and do not regularly match test or field data. Reference [9] developed reliability predictions for three different Circuit Card Assemblies (CCA) using MILHDBK-217 and PRISM®. This study focused on three digital CCAs designed primarily with plastic surface mount components (i.e., Plastic Encapsulated Microcircuits or PEMs). Table 2 provides a synopsis of this analysis. For CCA-1, it was shown that the MIL-HDBK-217 prediction with a πQ factor of 2 most closely matched the field failure rate. For CCA-2, it was shown that the MIL-HDBK-217 prediction with a πQ factor of 3 most closely matched the field failure rate, and for CCA-3, the MILHDBK-217 prediction with a πQ factor of 2 most closely matched the field failure rate [99].

This study used different πQ factors to produce varying results. As can be seen, this highlights another issue with reliability predictions. The reliability prediction models depend on user input, and if not selected correctly, the prediction results will not match the true inherent failure rate. Specifically, Table 2 shows that, depending on the πQ factor chosen, the predicted value could be factors off from the field failure rate.

TABLE 2 PREDICTED VS. FIELD FAILURES [9] TABLE 2: Predicted vs. Field Failures [9]

Study [5] used MIL-HDBK-217, PRISM®, and the FIDES method to compare predicted values against field failure rates for three different CCAs (Digital Correlator, RF Synthesizer, and +/-12V Power Supply). Table 3 shows a summary of the predicted values and field failure rates. Here, the FIDES method most closely matched the field failure rates for both the Digital Correlator and the RF Synthesizer. The PRISM® prediction was the closest to the observed rate for the +/-12V Power Supply (PS). The two FIDES predicted values closely matched the field failure rates but did not for the +/-12V PS. It is again demonstrated that there is inconsistency in the application of predicted failure rates to observed values from field [5].

TABLE 3 PREDICTED VS. FIELD FAILURES [5]

A current study of avionic assemblies compared predicted MTBFs against observed field failure rates. The observed field failure rates were measured in Mean Flight Hours Between Failure (MFHBF), and the assemblies were at the Weapons Replaceable Assembly (WRA) level. Table 4 shows the predicted and observed values, along with the percent decrease from MTBF to MFHBF. The selected WRAs in Table 4 had the largest percent decrease of all WRAs within the database. These WRAs were chosen to again demonstrate the variability that could exist when comparing predicted failure rates to observed values.

TABLE 4 PREDICTED VS. FIELD FAILURES FOR WRAS

Note that the definition of failure for observed rates is significant when comparing to a predicted value. In all the above cases, the definition of a test or field failure was not considered or discussed. It was assumed that the failure in each case was appropriate and relevant to the individual study.

Moreover, further clarification is required for the specific MFHBF values in Table 4. The MFHBF values tend to be lower because of several reasons, two of which are inconsistent uploads to its reliability database and the scoring or adjudication of what constitutes a failure. Inconsistent uploads can affect the current MFHBF by not having the current flight hours or records when developing an MFHBF. Additionally, due to the failure scoring procedure, the MFHBF values in these cases are biased conservatively. In other words, not all failures are verified failures, but still counted as such. Additionally, MTBF is measured in total operational or test time, whereas MFHBF is measured only in flight hours. There could exist significant discrepancies between the two time measures. Note that time or duty cycle are extremely important factors to consider in comparisons between reliability parameters.

In any case, given the large number of variables (e.g., environment, maintenance, logistics), when producing any observed failure rate, there will always be discrepancies. The MFHBFs in Table 4 are the best current estimates of field failure rates for the WRAs in the study.

IV. CONCLUSION

Altogether, it has been shown that most well-known reliability prediction methods have pitfalls and are not consistently accurate when predicting field failure rates. However, industry skepticism over the accuracy of the MTBF parameter is not warranted. The failure of reliability predictions is not due to the inaccuracy of the models or calculations but in the interpretation of the reliability prediction’s objective. The prediction methods are a design tool and should not be used in programs or projects as an absolute measure of a product’s reliability. Recall that one of the true original intents of reliability prediction was not to develop an absolute product failure rate or MTBF but to provide an adequate means to compare competing designs or measure a product’s reliability as it matures through the design process.

Many of the novel prediction methods are based on adjusting a component’s base failure rate with multiplicative adjustment factors. These factors are based both on hardware and process attributes. The need to generate these factors is due to the desire to perfectly predict a product’s MTBF in the field. Unfortunately, this ongoing effort is an extremely difficult task.Predicted MTBFs are developed on the basis of an inherent component failure rate and then adjusted via assumed environmental, maintenance, and process factors. Attempting to account beforehand for the numerous considerations (other than inherent) that affect a product’s reliability in the field must be done with a total understanding of the system’s operational and maintenance processes. Given the variability in these processes across different types of systems and environments, it has been difficult to develop a truly accurate, consistent, and a one-size-fits-all prediction methodology.

Furthermore, new systems may require or specify fielded reliability requirements, creating the need to truly bridge the gap between predicted and field failure rates. This can be better accomplished through analyzing historical data and deriving adjustment factors against specific prediction methodologies. Adjustment factors will become more accurate as the analysis of fielded or historical data becomes more specific. In other words, developing one top-level adjustment factor for all system assemblies would not provide the best estimate. Developing adjustment factors at the lowest possible level for each Line Replaceable Unit (LRU), WRA, or Shop Replaceable Unit (SRA) would be optimal. Developing a more individualized productspecific approach to determining adjustment factors through field data is more appropriate than utilizing the existing standard prediction methods.

This paper provided a summary of MTBF and highlighted some of the more well-known prediction methodologies. By no means is this an exhaustive literature review or research analysis, but instead, provides a basic understanding for utilizing the MTBF parameter correctly.

REFERENCES

[1] Military Handbook 217 - Reliability Prediction of Electronic Equipment, Department of Defense, 1995

[2] McLeish, J. G., “Enhancing MIL-HDBK-217 reliability predictions with physics of failure methods,” 2010 Proceedings -Annual Reliability and Maintainability Symposium (RAMS), 2010, pp. 1–6, doi: 10.1109/RAMS.2010.5448044

[3] Mou, H., Hu, W., Sun, Y., and Zhao, G., “A comparison and case studies of electronic product reliability prediction methods based on handbooks,” 2013 International Conference on Quality, Reliability, Risk, Maintenance, and Safety Engineering (QR2MSE), 2013, pp. 112–115, doi: 10.1109/QR2MSE.2013.6625547

[4] PRISM®, Users’ Manual Version 1.4, Reliability Analysis Center, 2002

[5] Marin, J. J., and Pollard, R. W., “Experience report on the FIDES reliability prediction method,” Annual Reliability and Maintainability Symposium, 2005. Proceedings., 2005, pp. 8–13, doi: 10.1109/RAMS.2005.1408330

[6] Nicholls, D., “What is 217Plus/sup TM/ and where did it come from?,” 2007 Annual Reliability and Maintainability Symposium, 2007, pp. 22–27, doi: 10.1109/RAMS.2007.328101

[7] Bourbouse, S., Giraudeau, M., and Briard, H., “Adapting FIDES for reliability predictions aimed at space applications,” 2019 Annual Reliability and Maintainability Symposium (RAMS), 2019, pp. 1–6, doi: 10.1109/RAMS.2019.8768925

[8] Carton, P., Giraudeau, M., and Davenel, F. “New FIDES models for emerging technologies,” 2017 Annual Reliability and Maintainability Symposium (RAMS), 2017, pp. 1–6, doi:10.1109/RAM.2017.7889686.

[9] Brown, L.M., “Comparing reliability predictions to field data for plastic parts in a military, airborne environment,” Annual Reliability and Maintainability Symposium, 2003, 2003, pp. 207–213, doi: 10.1109/RAMS.2003.1181927.