by Paul R. Casto, CMRP
To optimize both maintenance risk and cost, the interrelationships between reliability, maintenance and operations must be considered and leveraged to capitalize on the strengths of each. Reliability Centered Operations (RCO) is an approach that optimizes these relationships through the application of a maintenance strategy built from failure analysis that will yield more expansive and cost-effective risk reduction tasks.
A high percentage of the results from traditional RCM/FMEA analyses are conditioning monitoring tasks, and the majority of these are tied to operator rounds. This approach links the operators into the development and execution of this strategy. Two key elements are required to effectively break down the walls between operations and maintenance: empowering operators with technology and handheld devices to execute reliability-focused operator rounds, and utilizing asset performance enterprise systems to manage data flow, thus tying operators into the work management system already utilized by maintenance. This article discusses the technical solution for a systematic, technology-based approach to develop a strategy and results for typical projects using this RCO approach.
The goal of Asset Performance Management (APM) is to safely maximize predictable production at the lowest sustainable cost while addressing the risk profile of the business. This is primarily achieved by implementing APM best practices, which use data-driven reliability methods and sustainable maintenance practices. The objective of a reliability and maintenance strategy is to provide the minimum amount of maintenance work required to meet the business needs of the operation. To achieve this, reliability and maintenance must be focused on and be structured to support and understand the changing goals and requirements of the business units.
Reliability is about failure elimination, and maintenance is about work processes. These functions are inextricably woven together with the fate of one being dependent on the success of the other. While there are many tools and techniques available for the plant maintenance manager in the quest to improve APM performance, none is more powerful than the effective and efficient use of the maintenance workforce. This workforce is the primary means to execute repair and improvement tasks. Further, the workforce labor and material usage make up the majority of the plant maintenance cost. Optimization of this group’s work plan is paramount to lowering cost, improving reliability and meeting the risk profile of the business.
Current economic conditions have changed the risk profile that most businesses must manage. These new profiles include lower levels of production, greater variability in production schedules, and shorter windows for product delivery. These conditions result in risk profiles where lower costs are needed and slower repairs can be tolerated, but the need to respond to orders dependably is required. These changing risk profiles are manifested in maintenance as “cost cutting” while still requiring critical equipment to be operational at the right time, further emphasizing the need for a cohesive approach to managing the relationship, information flow and interdependence between reliability and maintenance.
The interdependence between reliability and maintenance is best illustrated when considering that they intersect at the failure mode level. That is, maintenance is either repairing a failure that has occurred or performing work to predict and prevent failures that have not yet occurred. Thus, all maintenance activity should be focused on the failure mitigation principals of:
Developing a maintenance strategy based on these four failure mitigation principles requires that equipment failure modes are analyzed in detail. This analysis is central to creating actionable maintenance plans that contain optimized risk mitigating tasks, which can then be applied based on the risk vs. cost profile the business units can support. The optimized integration of reliability and maintenance at the failure mode level is a strong tool and is one aspect of maximizing APM results.
These failure mitigation strategies can be leveraged for improved results by expanding the ownership of equipment reliability to include operations. Operators are knowledgeable about the equipment operating characteristics and circumstances surrounding many of the failure modes, and have an acute understanding of the failure consequences. This experience equips operations to play a key role in the development, execution and sustainability of maintenance and reliability strategies.
Linking operator knowledge into a maintenance strategy based on failure analysis will yield more effective risk reduction tasks. In addition, by participating in task creation, operators will feel “ownership” of the solution, which is a pivotal factor in program sustainability.
This cross-functional failure mitigation strategy requires the identification of conditions, processes and upsets which can lead to equipment damage before the damage occurs and the failure mechanism has begun to deteriorate the equipment’s performance. This is illustrated on the Installation, Potential Failure, Functional Failure (IPF) curve shown in Figure 1. The objective of this strategy is to extend the IPF curve over time (to the right) to realize longer equipment life.
Creating maintenance plans based on failure analysis is a basic element of improving asset performance. Properly designed maintenance plans integrate reliability principles with actionable maintenance tasks, and through the application of these tasks maintenance can impact asset performance. Developing these maintenance plans is best done using a systematic approach implemented by a cross-functional team. The steps of this approach can be summarized as follows:
1. Forming the Failure Analysis Team – The formation of the analysis team is an important part of the overall strategy development. This team should provide the technical and operational expertise, the interface to their maintenance, operations and supervisory peers, and development and debugging of the plans. The team is typically comprised of a Reliability Engineer/Facilitator, an Operator, a Mechanic, and a Team Leader (Foreman). Other resources can be brought into the team on an ad hoc basis.
These teams can sometimes be reduced to three members: operator, mechanic and team leader. This approach reduces the resource requirements to perform the analysis. In this case, the team leader will have extensive hands-on plant experience, demonstrated experience applying reliability tools and strategies, and is skilled in failure analysis facilitation.
2. Mapping the Equipment and the Process – The next step in the process is to map the process flow and the layout of the supporting manufacturing equipment. The mapping process is similar to the Six Sigma and lean mapping process. This is a powerful tool and the resulting map will help the team visualize the process and understand the equipment. The map is also useful in identifying system constraints as well as production bottlenecks, and it will facilitate equipment criticality analysis. It is also helpful in getting the team members to learn aspects of the equipment and processes that they may not already be familiar with.
3. Analyzing Criticality – Criticality ranking is an important step in identifying what type of maintenance strategy should be applied to individual equipment. Criticality is determined by integrating the probability and consequence of failure. Factors such as safety, environmental impact, risk to production loss, replacement cost and maintenance cost are typically included as part of failure consequence. The consequences of failure can also be weighted by the factors mentioned above and used to determine overall risk ranking.
There are known and accepted processes to perform criticality ranking, which is usually done by a cross-functional team operating with clear ranking guidelines. The criticality analysis is typically done at the equipment level, and the output will rank the equipment from most to least critical. This may be represented as high, medium and low, or with a numbering system such as 1-5. The highest criticality should contain no more than 5 to 10% of the equipment, the important (medium level) equipment 30% to 60%, and lower levels should cover 30% to 50% of the equipment.
Using these guidelines, the bulk of the analysis work will be done on 40%-60% of the plant equipment and the amount of equipment using a run to failure strategy is driven by the criticality analysis. A problem occurs when organizations want to rank the vast majority of their equipment at the highest level, which hinders the development of effective R&M plans and will lead to excessive (non-value added) maintenance work. Discipline must be exercised to adhere to criticality guidelines.
Using these rankings, a comprehensive asset maintenance strategy can be applied. These strategies will contain a mix of tasks that are optimized to mitigate failure based on the criticality of the equipment. The strategies applied to equipment with higher criticality rankings require more analysis and often result in more complex maintenance strategies. Some maintenance strategies which could be associated with criticality ratings are:
1. Reliability Centered Maintenance (RCM) methodologies
2. Equipment-centric failure mode and effects analyses (FMEAs)
3. Analysis of existing maintenance plans (often called PM optimization)
4. Application of predefined maintenance plans (based on equipment
class and service)
5. Basic equipment care
4. Analyzing Failure – To build maintenance plans based on failure analysis, a systematic work process should be followed. For the critical and important ranked equipment this may be done using classical RCM, RCM Blitz® or basic equipment-centric FMEAs. These processes are well known and accepted and can be applied directly to the equipment to evaluate failure modes, causes and effects. For any of these methodologies, FMEA is at the heart of failure analysis and development of the associated mitigating tasks. FMEA focuses on failure modes and causes that lead directly or indirectly to equipment failures. Some of these are:
• Gradual equipment deterioration
• Process upsets that damage equipment
• Human error
• Variation in operational parameters
• Variation from standard operating procedures
• Improper equipment installation, repair and maintenance
The resulting tasks are designed to mitigate the causes of the failures in a proactive manner. That is, many of the tasks will address issues on the IPF curve (Figure 1) prior to the potential failure point (P) being reached. The resulting actions prevent equipment damage and lengthen the operating life of the equipment by extending the IPF curve to the right as shown in Figure 2.
5. Analyzing Risk – One of the outputs of the initial FMEA process is an unmitigated risk index, which can be modified for user preference but, at a minimum, it accounts for the likelihood and consequence of the failure. This index may be used to filter those failure modes from the analyses that have low risk. The output of this filtering process will focus the plan on the highest risk failures. Typically, a cut-off value for the risk index is established, and, for those items which exceed this value, risk reduction tasks are developed.
These tasks are then evaluated and a new risk index (mitigated risk index) is developed to measure the impact of the tasks on risk reduction. The effectiveness of the tasks can be evaluated vs. implementation cost, and, based on the risk reduction value, a further filtering process can be applied. This process will focus the results of the analysis to create the highest value maintenance plan.
6. Developing the Mitigating Tasks – The mitigating tasks are developed as part of the FMEA process and they are filtered based on the risk and cost profile for the business. The goal of these mitigating tasks is to proactively eliminate process, operating and maintenance problems that lead to equipment damage. Typical mitigating tasks resulting from the analysis include:
1. Development of standardized work procedures
2. Time-based maintenance actions
a. Preventive maintenance
d. Minor repairs
3. Condition monitoring
a. Predictive technologies
c. Diagnostic systems
d. Process monitoring
4. Equipment design
a. Poor equipment design requiring upgrades
b. Redesigns to improve reliability
c. Redesign to improve maintainability
Typical project results indicate that task breakdown by group may be as seen in Figure 3.1 This chart indicates that 60% of the tasks (by number of tasks) belong to the operations group, due to the number of operator inspections that come from the FMEA. It is worthy to note that, if operations weren’t included on the failure analysis team, and subsequently participating in the design of the maintenance strategy, almost 60% of the mitigating tasks (risk reduction tasks) would not have been included in the final plan. This aspect of building maintenance plans using a cross functional failure-analysis team is not seen in traditional maintenance approaches and is often overlooked. Clearly, this method offers a significant advantage over traditional plan development.
A breakdown of mitigating tasks by type is shown in Figure 42, indicating that the proactive condition monitoring tasks make up 60% of the total tasks by type. This is in alignment with the proactive strategy illustrated in Figure 1.
7. Integrating Proactive Operator Inspection Tasks – Operators know the operating parameters of their processes and equipment and recognize when it isn’t running correctly. Using this knowledge in building failure-based maintenance plans will identify mitigation tasks which are typically included in the operator rounds (inspections). These inspections tasks utilize the operator’s senses and knowledge base to detect process disturbances and equipment problems that are much harder to capture using sensor technology. These inspections can comprise up to 50% of the R&M program’s proactive tasks. Further, due to the proactive nature of these inspections, critical data is obtained either prior to equipment damage occurring or very early in the failure process. Using handheld technology (Figure 5) as an enabler, the operator can easily and efficiently input conditions to create alarms. These alarms provide the basis for actions to be taken prior to the onset of performance degradation and equipment damage.
The proactive operator inspections are a direct result of the FMEA process. Some examples of these are, but certainly not limited to:
1. Visual inspection such as looking for leaks
2. Inspections dependent on the operators senses (for example,
“it sounds different”)
3. Quantitative reading of values, such as the level of lubricant
4. Use of basic tools, such as vibration pen, heat tape, etc.
5. Equipment settings
The execution of these tasks will also vary by time periods, which may include every shift, once a day (or multiple days), once a week (or multiple weeks), once a month, etc. Taken together, these tasks must be integrated into an executable operator round. This is a major responsibility of the operator as a member of the analysis team. The operator will organize the routes, do the initial runs and testing of the routes, review and test the routes with the other operators, lead the rollout of the routes, and provide the necessary training for the users.
Also it is worthy to note that using handhelds as an enabling technology provides the opportunity to set up routes and gather data on functions that are not related to reliability or operations, such as safety inspections, housekeeping, environmental inspections and maintenance PM inspections.
8. Linking Proactive Alarms with the Work Management System – Operators have been performing inspections for years, often recording their findings on paper and most recently using handhelds to perform standard process checks. And, the integration of operator inspections into the maintenance strategy is not new. However, the process as described in this article is innovative in two areas:
1. The operator inspections were developed based on failure analysis.
This results in proactive tasks which utilize the operator’s senses and
knowledge to identify process and equipment variations that can lead
to equipment damage and, thereby, extend the life of the equipment.
2. Properly designed, the handhelds provide a portal to link the operators
work identification efforts into the work management system. This
allows the efficient identification and dispatch of failure causing
disturbances. This is illustrated in Figure 7.
Proactive work identification is driven by (1) allowing the operator to create alarms on out-of-limit conditions at the point of detection and (2) processing these alarms into the work management system (an EAM such SAP or a CMMS). The operator normally runs his/her route, creating alarms on the handheld screens. At the conclusion of the route, the handheld devices can be connected to the operator’s computer, the alarms reviewed, and any other relevant information is added. The alarms are then processed to the work management system for work order creation, planning and scheduling. Wireless technology could also be applied and the alarms sent directly to the work management system from the field if desired. It should be noted that, if during the inspection, the operator finds a condition that requires immediate attention, this is handled by the normal emergency work notification process.
This process links the operator rounds input into the work selection process. This overall maintenance strategy and work selection process is normally managed by an Asset Performance Management (APM) system, which manages the data, creates alarms and passes the information to the work management system. Work management is typically done in the CMMS or EAM system. This process is shown in Figure 6.
Recently, one major U.S. chemical producer implemented an integrated approach between reliability and maintenance focused on operator rounds that included the right mix of people, education, goals and technology. At this company, more than 70 operators embraced the use of a technology-supported initiative that yielded a 40% reduction in reactive maintenance within one year - clearly a fulfillment of the RCO approach’s driving principle to positively impact the bottom-line.3
9. Developing Metrics for Evaluating Strategy Performance – This work process should be monitored closely in order to measure (1) progress of strategy development, (2) completeness of mitigating tasks, and (3) bottom line results. These measures should focus in three areas:
1. Completion of FMEA Tasks - There will be many tasks resulting from the analysis that will not be completed immediately and, depending on the nature of the tasks, these can take months to complete. It is critical that the leadership team monitor the progress against the task completion schedule in order to provide needed resources and support. The measures which should be considered are:
a. Percent FMEA tasks complete in total
b. Percent FMEA tasks complete by task category
c. Percent FMEA tasks complete by group responsible for completion
2. Effectiveness of the Proactive Strategy - The different elements of the proactive strategy developed by this work process should be measured to understand the strengths, weaknesses and effectiveness of the strategy. Some measures which should be considered are:
a. Percent of maintenance work identified proactively
b. Percent of work from condition monitoring tasks
c. Percent of maintenance work identified through handhelds
d. Percent of inspection routes completed
3. Bottom Line Results - The results of the strategy must be measured in order to understand the enterprise-wide, bottom-line value that has been created, both short term and long term. Some measures for consideration are:
a. Percent reactive work
b. Percent downtime
c. Increased availability
d. Additional revenue
e. Maintenance cost
One of the most telling of all the measures is the percentage of reactive work. There are various ways to define “reactive work” but generally it is work that must be done immediately, is unplanned, and breaks into the weekly schedule. It is normally the result of a breakdown, safety, or environmental issue. As the failure based maintenance plans are implemented, changes in the levels of reactive work should be seen quickly. The first step in any improvement program is to stabilize maintenance in the targeted area by lowering reactive work. This is done by addressing failures and responding to the proactive work that is identified from the maintenance plans and especially operator rounds. These actions will lower reactive work because the work is being identified in order to correct defects before the failure has occurred, and, in some cases, before equipment degradation has begun. As these proactively identified items are addressed, the level of reactive work will drop, and maintenance resources can be utilized more effectively.
It should be noted that once the plan is implemented, and proactive work is being identified, the single most important element of success is performing the proactive work and addressing the failure before it occurs. If the operations and maintenance organizations fail to take advantage of the prior knowledge of impending failures, the program will not achieve the desired results.
Centerpiece to the approach described in this paper is the ability to obtain condition monitoring data, process this data, and efficiently convert this information into proactive maintenance actions. A critical part of this is the management of the overall information flow. As can be seen from Figure 4, the majority of mitigation tasks can be categorized as condition monitoring. Figure 3 indicates that, due to the large number of operator inspection tasks that are identified in the FMEA process, the majority of the tasks (by number) are implemented by operations. This complete set of data must be integrated and communicated to the users via asset health indicators. This information, when properly structured and displayed, can be reviewed by users to quickly understand the alarms and assess the overall health of the assets.
The ideal information-flow system should integrate all data from predictive technology, process historians, diagnostic systems, and engineering analysis. The data flow should be bi-directional where applicable and linked to the asset performance management system where it can be reviewed, converted to a work request, and sent to the work management system or otherwise be dispositioned. The objective of this data structure is to maximize information effectiveness with as little manual intervention as possible. This optimal information flow is illustrated in Figure 7.
One of the primary tools in the manufacturing plant to improve equipment reliability and improve asset performance is the maintenance strategy. Maintenance and reliability intersect at the failure mode and this interrelationship is exemplified in the creation of improved maintenance plans using failure analysis. A key result of these improved maintenance plans will be a reduction of reactive work. A secondary effect of this reduction will be observed in work planning and scheduling. As the reactive work is reduced, this will free up resources to focus on planning and executing proactive work, which will lead to a further reduction in reactive work. The impact of this cycle can be dramatic.
Another area that is substantially impacted by cross-functional failure analysis is safety. Ten to twenty percent of recommendations resulting from this process will be focused on correcting unsafe conditions.4 Further, it has been shown that there is a positive correlation between the amount of immediate corrective and reactive work to total injuries.5 This positive correlation means that lowering the amount of reactive work will also lower the total injuries incurred. This is a very important benefit that should not be overlooked by the plant leadership team.
Overall results from implementation of this strategy include the following:
• Significant increase in utilization and availability
• Reduction in reactive work by up to 50%6
• Increase in operators’ knowledge of their equipment
• Substantial improvement in management of backlog as more work
is ready to schedule7
• Significant increase in planned maintenance work
• Reduction in maintenance cost8 (The largest impact in this reduction
will be seen after all of the mitigating tasks are complete.)
• Improvement in MTBF and MTTR
Many reliability and maintenance solutions been implemented in manufacturing plants in recent years, and these have seen varied success. Leveraging the interactive relationship between maintenance and reliability in order to improve overall asset performance can be a key differentiator and force multiplier in developing a successful APM initiative. To increase the probability of achieving the desired results, the potential success of mitigating tasks can be increased by integrating operations into the development and execution of the technical solution. Capitalizing on the knowledge and sensory capability of the operators will increase the options available to address failure causes, thereby multiplying the improvements of this approach versus a traditional maintenance program. Technology and information flow must be viewed as a work process enabler, and as a necessary step to execute failure-driven maintenance strategies efficiently.
The results of this approach have proven to be impressive and far exceed those that are seen from many traditional R&M programs. However, this approach does require discipline and strong leadership to be successful. Further, it must be emphasized that after implementation has begun and proactive work is being identified, the single most important element of success is completing the proactive work before equipment degradation begins or the failure occurs. When the operations and maintenance organizations take advantage of the prior knowledge of an impending failure, the program results can be outstanding.
Paul Casto, CRE, CQE, CSSBB, CMRP, VP Value Implementation, Meridium, is a leading practitioner in reliability and maintenance improvement methodologies. He has hands-on experience in reliability, maintenance, operations and engineering in the chemical, steel, aluminum, automotive, aerospace, consumer goods and construction industries. Paul holds a Bachelors degree in Electrical Engineering from West Virginia University, a Masters degree in Engineering Management from Marshall University Graduate College, an MBA from Clemson University, and a Masters in Maintenance Management and Reliability Engineering from the UT/Monash University program. He is currently doing additional graduate work related to R&M improvement methodologies at the University of Tennessee. Paul is an ASQ certified Six Sigma Black Belt, holds ASQ certification in Reliability Engineering and Quality Engineering and is a SMRP Certified Maintenance and Reliability Professional. He is a member of the University of Tennessee’s Maintenance and Reliability Center’s advisory board, serves on the SMRP Best Practices committee, the SMRPCO Advisory Council and is an active member of ASQ and IEEE.
1. Kathy Light and Steve Powers, “Managing
Change in a Major Reliability Improvement
Effort”, MARCON 2010 Proceedings (2010)
Presentation slide 15.
3. APM Advisor, Feb 2008 issue, http://www.
4. Kathy Light and Steve Powers, “Managing
Change in a Major Reliability Improvement Ef-
fort”, MARCON 2010 Proceedings (2010) p. 13.
5. Ron Moore, “Reliability Leadership for Manu-
facturing Excellence”, December 2008 Work-
shop, slide 18 in presentation.
6. Mark Mitchell, “Improving Asset Strategies
Using Handhelds”, Meridium Conference
2008. Slide 53.
8. Kathy Light and Steve Powers, “Managing
Change in a Major Reliability Improvement Ef
fort”, MARCON 2010 Proceedings (2010) p. 13.