The RELIABILITY Conference: 2 Days of Learning, Networking and Reliability Excellence

The RELIABILITY Conference® : TRAIN & TRANSFORM

Sign Up

Please use your business email address if applicable

reliability engineering

Cost Of Unreliability

The cost of unreliability is a big picture view of system failure costs, described in annual terms, for a manufacturing plant as if the key elements were reduced to a series block diagram for simplicity. It looks at the production system and reduces the complexity to a simple series system where failure of a single item/equipment/system/processing-complex causes the loss of productive output along with the total cost incurred for the failure. If the system IS sold out, then the cost of unreliability must include all appropriate business costs such as lost gross margin plus repair costs, scrap incurred, etc. If the system is NOT sold out, and make-up time is available in the financial year, then lost gross margin for the failure cannot be counted. The cost of unreliability is a management concern connected to management's two favorite metrics: time and money.

Life Cycle Cost

Life cycle cost (LCC) are all costs associated with the acquisition and ownership of a system over its full life. The usual figure of merit is net present value (NPV). Projects are considered most favorable for large positive NPVs. However for many cost individual cases, decisions are made for the least negative NPVs. In all cases, the default position for accounting is to know the NPV for making no change and this is usually the last alternative for most people associated with change.

Simultaneous Testing

For inexpensive components and inexpensive tests, simultaneous tests involve many components under test loads/conditions at the same time for the purpose of quickly acquiring data and producing test analysis as the failures occur. In simultaneous testing the suspensions (censored data) become important details for use in the statistical analysis. Most simultaneous tests are accelerated to generate the data in a short period of time although this carries the risk of introducing unexpected failure modes (but this can also be useful information for anticipating field failures).

The Reliability Engineering Toolbox: Fault Tree Analysis

Fault Tree Analysis

Fault tree analysis (FTA) is a top down processes of defining the top level problems and through a deductive approach using parallel and series combinations of possible malfunctions to find the root of the problem and correct it before the failure occurs. The reliability tool can be used as qualitative or quantitative methods.

banner
A weekly collection of recommended articles and videos to boost your reliability journey. Right in your inbox
DOWNLOAD NOW
The Reliability Engineering Toolbox: Failure Rates

Failure Rates

Failure rates, in the simplest form, are S(time in use)/S(number of failures) or the reciprocal of mean times to/between failure.

Reliability Tools - Reliability Engineering

A strategic job for preparing plans to reduce the failures and the cost of failures as a preventative measure to reduce the cost of unreliability. Acquires failure data and analyzes the data to quantify the financial impact and prepare long term solutions to prevent reoccurrences to improve reliability and uptime. Determines the cost advantages and proposes alternatives for solving the problem and recommends the alternative with the lowest long term cost of ownership. The purpose of these actions is to prevent failures.

Reliability Growth Models

Reliability growth models are important management concepts for making reliability visual with simple displays. The simple log-log plots of cumulative failures on the Y-axis against cumulative time on the X-axis often make straight lines where the slope of the trend line is highly significant for telling if failures are coming faster (b>1) which is undesirable, slower (b<1) which is desirable, or without improvement/deterioration (b=1), which usually drifts toward undesirable results. The reliability growth models are frequently called Crow-AMSSA plots in honor of Larry Crow's proof of why the charts work as described in MIL-HDBK-189 when he worked with AMSAA.

Sudden Death Testing

For expensive components and expensive tests, sudden death tests involve a few components that tie-up a test frame as they are heavily loaded under the same test loads/conditions with several items being run at the same time. When one of the items fails the entire test frame is shut down so that you have 1 failure (this is the sudden death!) and several suspensions because the unfailed units are survivors as the test is halted until the test frame is loaded with new samples for resumption of the life test. Opening the test frame (instead of tying up the frame until all samples have failed) is cost effective. If three units can be tested simultaneously and the test is halted on the first failure, then perhaps we will literally have only 4 failures and 8 suspensions for preparing the Weibull analysis. Will the 4 sample + 8 suspension data set be different than if all 12 samples had been run to failure?-the answer is yes, they will be different, but will they be significantly different-the answer is no to the significant difference. So, as with simultaneous testing the suspensions (censored data) become important details for use in the statistical analysis. Most sudden death tests are accelerated to generate the data in a short period of time although this carries the risk of introducing unexpected failure modes (but this can also be useful information for anticipating field failures).

Software Reliability

Software does not wear out but it does fail and most failures are due to specification errors and code errors with only a few errors in copying or use. The only software repair is by reprogramming and adding safety factors is almost impossible. Software reliability improves by finding errors and fixing the errors but estimating the number of errors which canse failures is extremely difficult as many branches of software code may lie dormant and unused until special events occur to make the latent failures obvious. Software failures are not often time related but are more software code page dependent. Software reliability is improved by extensive testing to disclose the failures and then fixing them to repeat the test all over again to validate the fix did not generate more failures and to continue the search of other latent defects.

Configuration Control

Configuration control is involved with the management of change by providing traceability of failures back into the design standard. If the design details are not specified, the design will not contain the requirements and thus implementation of the project will be hit or miss for achieving the desired end results beginning with the conceptual design and resulting in the operating facility.

Reliability Tools - Maintenance

All actions necessary, both technical and administrative, for retaining an item in or restoring it to a specified condition so it can perform a required function. The actions include servicing, repair, modification, overhaul, inspection, reclamation, and restored condition determination.

Load-Strength Interactions

For reliability successes, loads must always be less than strengths. When loads are greater than strengths, failures occur. The issue is determining the probability of load-strength interference which is a joint probability of when loads exceed strengths. The loads should include expected conditions plus the foolishness of people to violate rules and overload equipment, plus the vagaries of Mother Nature to impose unexpected static and dynamic loads from hurricanes, tornadoes, earth quakes, wild fires, and so forth.

The Reliability Engineering Toolbox

Weibull Database

The smartest way to maintain a reliability database is in Weibull format and Weibull databases are available.

Lognormal

Lognormal distributions are continuous life functions that have long tails to the right (display positive skewness) in time or usage. A lognormal distribution plotted on semi-log papers would appear as a normal curve.

The Reliability Engineering Toolbox

Overall equipment effectiveness (OEE)

Overall equipment effectiveness (OEE) is a manufacturing index to reduce complexity of discrete systems for problem solving and benchmarking.

Reliability-Centered Maintenance

Reliability-Centered maintenance (RCM) is a systematic planning process used to determine the maintenance requirements for a system. RCM expects the system has an inherent reliability and maintenance requirements are imposed upon the baseline of inherent safety and inherent reliability which can be no better than the worst than designed into the system.

The Reliability Engineering Toolbox

Reliability

Reliability is the probability that a device, system, or process will perform its prescribed duty without failure for a given time when operated correctly in a specified environment.

Dependability

The International Electrical Congress (IEC) defines dependability as "Dependability describes the availability performance and its influencing factors: reliability performance, maintainability performance and maintenance support performance." MIL-HDBK-338 defines dependability differently as a measure of the degree to which an item is operable and capable of performing its required function at any (random) time during a specified mission profile, given that the item is available at mission start. (Item state during a mission includes the combined effects of the mission-related system R&M parameters but excludes non-mission time; see availability.) Dependability is related to reliability with the intention that dependability would be a more general concept than the measurable issues of reliability, maintainability, and maintenance.

Reliability Block Diagrams

Reliability block diagram (RBD) models are graphical representations of a calculation methodology for reliability systems.

Availability is part of the Reliability Strategy Development toolbox

Availability

A tool for measuring the % of time an item or system is in a state of readiness where it is operable and can be committed to use when call upon.