Don't miss MaximoWorld 2024, the premier conference on AI for asset management!

Experience the future of asset management with cutting-edge AI at MaximoWorld 2024.

Sign Up

Please use your business email address if applicable

Reliability and maintainability management is the management of failure. By using specific approaches and tools, one can obtain optimized, cost-effective solutions to the design, assembly and use of a product.

by Fred Schenkelberg

Reliability is the probability of a product successfully functioning as expected for a specific duration within a specified environment. Figure 1 shows the four key elements to reliability: function, probability of success, duration and environment. Maintainability is a characteristic of design, assembly and installation that is the probability of restoration to normal operating state of failed equipment or systems within a specific timeframe, using specified repair techniques and procedures. Maintainability is related to reliability because when a product or system fails, there may be a process to restore the product or system to operating condition.

Figure 1Figure 1

The fundamental expectation from a customer’s point of view is for the product to work as promised. However, failure happens. How to anticipate and deal with failure are cornerstones to a successful reliability and maintainability (R&M) management program.

R&M engineering pulls together resources from across many fields, including design, materials, finance, manufacturing, failure analysis and statistics. R&M management requires knowledge of product specifications, apportioning reliability, an understanding of feedback mechanisms and a consideration of maintenance requirements. These aspects will be addressed in turn.


Everyone desires products that offer more features, provide higher value, cost less and last longer. These four aspects drive the development of any product. The set of product functions or features defines the operating state and, conversely, what a system failure may include. Although not required, a set of functions is often detailed at the outset of a product development program. During product development, the design is regularly evaluated or tested and compared to the desired set of functions.

Cost may or may not be the most important consideration over the product lifecycle, yet it is often known and tracked during development and maintained for products during use. Cost includes the cost of goods sold and may include the cost of service and repairs. It often does not directly include the cost of failure to the customer, yet that cost may be known. For example, when a deepsea oil exploration rig has to retract a drill head because of a part failure, it may cost close to $1 million per day. Many products during design have a cost target and this target is carefully monitored.

Time to market is another common requirement placed on the product development team. This is especially true for products with a short season for sales (e.g., the holiday market). Setting milestones and deadlines is a management tool that helps get the product to market in a timely and coordinated manner. Like functions and costs, shipping a product on time is routinely measured.

Clearly stating the complete reliability goal is not difficult to do at the beginning of a design program. Moreover, once stated, the goal provides a common guide for the development decision- making, along with reliability test planning, vendor and supply chain requirements, and warranty accrual. The goal certainly may change over the development process, as may product features, cost targets, or deadlines. The reliability goal is like any other product specification; it just concerns the performance over time after the product is placed into service.

It is important to state the reliability goal so it includes all four elements of the reliability definition (e.g., a wireless router provides 802.1n connectivity with features specified in product requirements document HWR003, in a U.S. factory environment, with a 96 percent probability of still operating after five years of use).


An extension of reliability goal setting is to break down the goal to cover the individual elements of the product, thus providing a meaningful reliability objective for each component. In a series system, the probability of failure for each element is lower than that for the overall system. The opposite is true for elements in parallel. For complex systems, the apportionment calculation may become more complex, yet the concept still applies.

Assigning a clear and concise reliability objective to each of your design teams and suppliers provides a means to make reliability-related decisions local to the element under consideration. This may influence design margins, material selection and validation techniques.

Keep in mind that all four elements are part of the apportioned reliability goal. Often, the environment and use profile will be different for different product elements. For example, the power supply may operate full time, but the hard drive may often be idle and partially powered down. The location within the product may alter the temperature the elements experience. One must localize the apportioned goal, or at least provide sufficient information, to fully articulate and act upon an apportioned reliability goal.

The process used to create the apportionment may be simply an equal allocation to each element or a weighting on expected or known reliability performance (e.g., predictions, models, etc.). Rarely is there enough information to provide perfect apportionment from the start. Apportionment will be a work in progress, evolving as the design matures, new information becomes available and the design is evaluated.

worker in plant


For system development, setting a reliability goal or any specification requires measurement of the performance compared to the desired performance. The difference may require changing the design or adjusting the goal.

What will fail is a core question facing nearly any product development or maintenance team. Understanding the expected failure mechanisms plays a crucial role in the steps followed to determine the expected failure mechanisms. In situations with known failure mechanisms and only minor design changes, there is little need to “discover” failure mechanisms. The focus may shift to those areas related to the changes and validation of existing failure mechanisms. Another situation may include many uncertainties related to failure mechanisms. A design change to eliminate a specific mechanism may reveal another, previously hidden, mechanism. A new material may involve exploration of how the material will react over time to its shipping and operating environment.

Discovery includes a range of tools available to R&M professionals. This may include literature searches, failure mode and effects analysis (FMEA), and discussions with suppliers or researchers. Discovery may involve a wide range of tests, including material characterization, step stress to failure testing, and highly accelerated life testing.

The intent is to find the weaknesses within a design and take steps to minimize failures. For example, a team may discover a material color fades quickly in sunlight, but adding a stabilizing agent may ensure color fastness. A highly accelerated life test (HALT) may expose a faulty layout and require a redesign of the printed circuit board. Understanding and characterizing the failure mechanisms enable the designer to avoid surprises in later product testing or during use.

FMEA is a tool used to merge the ideas and knowledge of a team to explore the weaknesses of a product. To some, this may seem like a design review, to others it is an exploration of each designer’s knowledge of the boundary to failure. Depending on the team and amount of knowledge already known, FMEA may or may not be a fruitful tool to discover product failures. However, it nearly always has the benefit of effectively communicating the most serious and likely issues across the team.

HALT is a discovery tool in which sufficient stress or multiple stresses are applied to a product to cause failure. Starting at nominal stress levels, the HALT approach then steps up increasing amounts of stress until the product no longer functions as expected. Careful failure analysis may reveal design weaknesses, poor material choices, or unexpected behavior. The failures provide knowledge on areas for improvement. A product that has its detected weaknesses resolved is more robust and, thus, able to withstand normal stresses and the occasional abnormal stress load without failure.

FMEA and HALT provide information about product design and materials that, to some extent, rely on previous knowledge about the expected failure mechanisms. Within the FMEA team, this knowledge is shared or a new question may be explored, possibly revealing new information. Because the HALT applies stresses that are expected to cause failure, in each case, a new product design or material may have an unknown response to an unexplored stress. Both tools serve a purpose and have proven very useful in the failure discovery process, yet acquiring more information about possible failure mechanisms may enhance both tools and the product.

Most materials and components undergo development and characterization. Modern products may have hundreds of materials and thousands of components, yet each has some failure mechanism history of exploration and characterization. As a minimum, new materials or components must be researched to understand the known failure mechanisms and how they manifest within the chosen design and environment. Published literature in scientific and engineering journals is a good place to start. Researchers should be engaged in a discussion about how the material may behave in the chosen design. Many component and material suppliers have intimate knowledge of the component or material weaknesses and are willing to share that with their customers.


Repairing a product presumes the product is repairable. Creating a product that is repairable is part of the design. Some products are not repairable simply because the repair process costs more than the value of the product. Products, such as an escalator or an automobile, have design features that make them economical to repair. The combination of the design, the supply chain for spare parts and tools, and the execution of repairs are all part of maintainability.

There are many metrics related to the time to repair: diagnostic time, spare part acquisition, technician travel time, equipment repair time, etc. Combining mean time to repair (MTTR) and mean time to failure (MTTF) information provides a measure of availability. Availability is related to the concept that the equipment is ready to work when expected. Concepts of throughput, capacity and readiness are related to availability.

In the design process, the designer needs to consider access, disassembly, assembly, calibration, alignment and numerous other factors when creating a system that is repairable. For example, a car’s oil filter has standard fittings, permitting the use of existing oil filters as a replacement. The design of the system may involve trade-offs between design features and aspects of maintainability, such as the cost of spare parts and the time needed to actually accomplish a repair.

For the team maintaining equipment, considerations include understanding equipment failure mechanisms, symptoms and MTTF expectations. Stocking of tools and spare parts can be expensive and minimized if the system behavior over time is understood. The team may require specialized training and certifications; these also may increase maintenance costs.

There are a couple of basic approaches to maintenance: time-based or event-based. If you change your car’s oil every three months, you are using a time-based approach; if you are changing your oil every 5,000 miles, then you are using an event-based approach. Both require some knowledge about the failure mechanism involved to set the triggering time or event criteria so the maintenance is performed before either significant damage to the system or failure occurs.

Another approach is to monitor indicators of the amount of wear or damage that has occurred and repair the unit just as that unit’s specific useful life is about reached. For example, periodically testing an oil sample may reveal when the oil is about to become ineffective as a lubricant. Monitoring and maintenance can be very sophisticated or very simple, such as having a brake wear indicator that causes a squealing sound. Prognostic health management is a relatively new field focused on measurement techniques that, like the wear indicator in brake pads, assist the maintenance team in maximizing the useful life of a product and effecting repairs and maintenance only as needed to prevent failure.


The various tasks and activities commonly associated with R&M are not accomplished without purpose. They add value to making decisions, provide direction and feedback, help to prevent expensive mistakes and avoid excessive repairs. The tools and resources of R&M engineering provide a means to efficiently achieve one’s R&M goals.

Fred Schenkelberg headshot

Fred Schenkelberg

Fred Schenkelberg is a leading authority on reliability engineering. He is the reliability expert at FMS Reliability, a reliability engineering and management consultant firm he founded in 2004. Fred has a Bachelor of Science in Physics from the United States Military Academy and a Master of Science in Statistics from Stanford University.
FMS Reliability

Fred Schenkelberg

Fred Schenkelberg is a leading authority on reliability engineering. He is the reliability expert at FMS Reliability, a reliability engineering and management consultant firm he founded in 2004. Fred has a Bachelor of Science in Physics from the United States Military Academy and a Master of Science in Statistics from Stanford University. 

ChatGPT with
Find Your Answers Fast