The Analysis Advantage - Reliability Incident Management and Strategic Data Collection

20 April 2009

Over the past few years, the author has researched the potential benefits, the methods used and difficulties experienced by a variety of industries and companies. Research indicates that a strong approach to incident management carries twice the benefit of implementing a CMMS and performing a maintenance strategy review.

However, in practice, it is found that incident management works far better after a maintenance strategy review has been performed and the preventive maintenance program is well executed.

Research also shows that the better incident management systems are supported by focused plant downtime recording systems significantly advanced from what is normally deployed.

The combination of these four elements is named Reliability Assurance by the author.

This paper also discusses how any organisation can embark on a Reliability Assurance program and move quickly to a position where it can better capitalise on their defect elimination and reliability incident management programs.

Aim

The aim of this paper is to develop an understanding of, and some tips about, the following:

The Reliability Assurance program,
The process of Incident Management, and
Setting up and managing strategic data collection systems.

The paper also covers the relationships of RCM / PMO and the deployment of a CMMS as part of a holistic Reliability Assurance program.

Preamble

Some Definitions

The following acronyms are used throughout the paper.

CMMS - Computerised Maintenance Management System

OEE - Overall Equipment Effectiveness

RC - ARoot Cause Analysis

RIMS - Reliability Incident Management System based on RIMSys ®

RCM - Reliability Centered Maintenance according to SAE standard JA 1011

PM - Planned Maintenance Optimisation - In this presentation when we refer to PMO we refer to our own version which is PMO2000™ ™. This version is not the same as PMO programs developed and used in the US Nuclear power industry. The difference being that PMO2000™ ™ produces the same results as RCM whereas the latter may not.

Notes to Readers The content of this paper assumes a basic knowledge of the concepts of the RCM task selection criteria according to the SAE standard SAE JA1011 titled Evaluation Criteria for Reliability Centered Maintenance processes.

Progress in the world of Industrial Maintenance and Reliability

The Past

Maintenance is a practice conducted across the span of the animal kingdom. As humans have built more sophisticated and dangerous machinery, and the world has become increasingly reliant on machines to deliver wealth, maintenance and reliability practices have continued to evolve. Changes in philosophy seem to be accelerating as is the thirst for better methods amongst practitioners.

In the industrial arena, many observers see the following as being significant changes in approaches.

Early 1900’s - Traditional methods based on overhaul.
1978 – Nowlan and Heap Report on RCM written for United Airlines. In this report, Nowlan and Heap coined the name Reliability Centered Maintenance. This was, and still is, a process of maintenance analysis developed for use in the design phase of the asset life cycle. (Moubray 1997)
Mid 1980’s – Promotion of RCM as a maintenance analysis tool in industries outside of airlines. In the absence of any tool better suited to developing asset maintenance strategy, RCM was used widely in plants that were already operating.
Mid 1990’s – Development and refinement of Planned Maintenance Optimisation (PMO) techniques. Through the inability of RCM to adequately cater for organisations that required a maintenance strategy review and rationalization process, the process of PMO became popular and was considered by their regulators, a major strength in some Nuclear Power Stations. (Johnson 1995)
Recent years – Further refinement of PMO methods that provide the same maintenance program as RCM but significantly faster. PMO2000™ being one of those processes.

The Future

RCM has definitely taken a grasp on the maintenance community as a tool considered to be best practice. However, the failure of RCM to become common place in industry and the high number of incomplete analyses is testament to the problems with its application in industry. The major problem with RCM applied to existing plant is that it is a very time consuming process. The rise in popularity of PMO processes is because they are considerably faster than RCM. PMO2000™ is a process that retains the speed while achieving the same maintenance program (Turner 2001). For these reasons the spread of PMO2000™ has been rapid. Users of PMO2000™ now do not get tied up with analysis; they quickly move into implementation and begin to look for greater opportunity.

In the author’s opinion, the future for asset intensive companies now lies in the integration of the four independent systems depicted in Figure 1.

Defining Reliability Assurance

Reliability Assurance is a term used by a variety of industries in many different ways. Web searches reveal uses in the following applications:

Software development
Electronic component design and manufacturing process design
Hardware designs both in computer and general applications.

All the above applications seem to be launched in the design phase of the product life cycle. Few, if any, processes of Reliability Assurance have been developed for existing plants. The following is the author’s model of Reliability Assurance designed specifically for assets in use or for new assets where there are similar assets operating elsewhere and / or the vendor has provided a maintenance schedule.

My definition of Reliability Assurance is “Reliability Assurance is a process that determines the inherent reliability and performance of an asset in its operational context and serves first to increase the level of performance to the inherent level by the application of preventive and predictive maintenance, and then, by data collection and problem solving, increase the inherent performance level through the introduction of modifications to the machine design, the operating methods or conditions.”

The Four Quadrants of Reliability Assurance

From a process perspective, a Reliability Assurance (RA) program contains the four quadrants, or elements, shown in the model at Figure 1 and described below:

A Maintenance Management System

A maintenance management system is the core of the maintenance administration function. It allows effective planning and scheduling, cost management and other administrative and execution capabilities. These systems are often computerised and known as Computerised Maintenance Management Systems (CMMS). Almost all modern CMMS products are very good at administration and execution of maintenance and perform some Reliability Assurance functions. However, in reality, these are very limited and unsuitable for efficient deployment of the remaining three elements of an effective and efficient Reliability Assurance program.

Strategy Review Systems

A strategy development and review system is a system that contains the list of all the plant failure characteristics and the strategy developed to eliminate or best-manage the consequence of unexpected failure. Typical methods of strategy development include Reliability Centered Maintenance (RCM) and Planned Maintenance Optimisation (PMO).

Performance Loss Data Collection System

A performance loss data collection system is a system that records a variety of downtime categories, rate loss and quality loss at component and failure mode level. Such a system often collects data electronically through DCS or SCADA systems, but then relies on links to strategy development to code and turn the data into valuable information.

Reliability Incident Management system

A reliability and incident management system is a system which controls the Root Cause Analysis and the “Plan Do Check Act” cycle of continuous improvement and failure elimination.

Relating Management Systems to the Reliability Assurance Model

To approach maintenance performance improvement, it is useful to understand the relationships between the four quadrants within the model and what returns each provides in terms of improved machine performance. The best independent information known to the author comes from the DuPont model of Up-Time (Ledet. 1994) featured in the Manufacturing Game1 . This model illustrates the relationships between the various elements of the Reliability Assurance four quadrant model. The table below illustrates how DuPont has modelled the relative effect of various strategies on plant uptime.

Ledet’s analysis suggests that if companies focus on planning only they will improve their uptime by 0.5%. If they focus only on maintenance scheduling, uptime will improve by 0.8%. If they focus on preventive and predictive maintenance only, uptime will actually get worse by 2.4%. If organisations focus on all of these three aspects, they will gain a 5.1% improvement in availability.

These results may well sound appealing in their own right, but subsequent to the report, Ledet found that by adding defect elimination to the initiatives undertaken, a further 9.7% (taking the total improvement to14.8%) improvement in availability may be achieved in their plants. This information is provided in the table at Figure 2

Figure 2 - Table showing the effect of different reliability engineering

activities on plant availability taken from the Manufacturing Game

The relationships between the four quadrants of Reliability Assurance and the process elements studied by Ledet are shown in Figure 3.

This relationship suggests that improving planning and scheduling by implementing a CMMS without a having a focussed PM program will not generate significant returns. Similarly, working hard at developing a focussed PM program without a good planning and scheduling system, will not generate significant returns either. The suggestion is that organisations should work on their CMMS planning and scheduling systems and their maintenance strategy development as well.

The other important factor is that the defect elimination process is the process that provides the most improvement opportunity. If this is the case, then the intuitive approach to secure improvement would be to focus on defect elimination first and then work on the other elements.

This approach will not work without a strong foundation of preventive maintenance. This is because without good preventive maintenance, reactive maintenance will prevail. In reactive mode, a high percentage of the failures will be cause by a lack of maintenance, not inherent problems with machinery design or operating problems. In this situation, any program to work on defects will, in all likelihood, be unable to determine if the failure was due to lack of maintenance or design. In addition, the volume of defects to analyse will probably be too high to cover, making the defect elimination program exhaustive and ineffective.

The proposed starting point therefore must be to get the fundamentals of effective PM in place which means that the first step in launching an Reliability Assurance program is to review the maintenance strategy.

Describing the Elements in Detail

Maintenance Management Systems
Some of the functions of a Maintenance Management System include the following:

Administration and Execution
Work orders
Work history
Spare parts
Cost control
Contractor management
Planning and Scheduling.

There is considerable literature available that provides information on setting up and using a maintenance management system. Because of this, the focus of this paper is not in this area. There are, however, some factors about the implementation of maintenance management systems that make effective Reliability Assurance difficult to accomplish. These factors are discussed in the following paragraphs.

Perception Issues

One of the problems in getting a Reliability Assurance program established is that many people in organisations believe that the CMMS can do more than it is capable. To put things bluntly, few, if any, CMMS systems have the data fields and data relationships necessary to provide information in the manner required for maintenance strategy development, plant performance management2, and incident management. The realisation that all of these systems are distinctly different yet need to be integrated is very important. The purchase and use of a CMMS alone will leave a large void in Reliability Assurance infrastructure.

To be effective at Reliability Assurance, an organisation needs systems, software and work processes that deal with the other thee quadrants.

Use of the system

Many CMMS’s are poorly used and not set up properly. This may be for many reasons. One of them is that in setting up a CMMS, the organisation has seen it as a tool to manage maintenance administration alone. They have not understood the concepts of Reliability Assurance. In computerising their systems they have failed to realise that setting up a system to administer maintenance is not going to work well if the underlying programs are not well defined or do not add value. The startling result of most PMO2000™ analyses completed over an eight year period is that barely fifty percent of the maintenance strategies contained in CMMS remain unchanged after review. It is common to delete 15% of the maintenance as it adds no value and it is common to add the same amount to manage failures that are preventable but have no PM. The remainder of the changes become interval extensions or reductions or moves from time based overhaul maintenance to condition based maintenance or vice versa.

Trying to plan and schedule a PM program that is only 50% effective can not be good management.

Incident Management

Reliability incidents can be defined as failures of plant and equipment that lead to any kind of loss, or increased risk to the business. In capital-intensive industries, the categories vary in terms of exposure and likelihood. Typically however, they include the following categories:

Threat to safe operation
Threat to the environment
Threat to the commercial viability of the company
Loss of customer satisfaction
Loss of production or failure to complete the mission
Breach of security
High repair cost.

The process of incident management is to identify and resolve plant or human failures that result in greater exposure or the loss of any of the above.

Ideally, organisations would take steps to remove such risks before they occur, however in practice, predicting every risk and reducing each of them to acceptable levels is a very difficult thing to do.

Many organisations have found that they can create a focussed maintenance strategy for all equipment (critical and non critical) within 12 months by taking a review and rationalisation approach. The problem with most maintenance strategy development activities is that information is never perfect and assumptions are made. This means that the maintenance strategy is a living program which needs incident management to keep it current and developing as better information comes to hand.

In addition to incorrect assumptions, there are other factors that could cause unexpected equipment failure. Some of these factors are as follows:

Temporary repairs installed and not removed
Maintenance error caused by poor training or lack of adherence to procedures
Maintenance not being done on time
Incorrect operation of equipment
Faulty parts installed.

The process undertaken to review reliability incidents is relatively simple and quite common place. It follows a typical investigation cycle found in many problem solving techniques. At a high level, the generic process that we prefer to use has seven steps which are listed below:

Originate
Allocate Analysis Responsibility
Analysis and Recommendations
Approve
Implement
Review
Close

As this approach is so well known, it is not considered necessary to discuss each step. However, it is worthwhile expanding on one unique aspect that pertains to Reliability Assurance. When reviewing equipment failure, there is a specific process flowchart that we recommend should be followed. The process is shown below in Figure 4.

The starting point [F] is any unexpected failure that has occurred in the plant. The first step is to define the failure mode or mechanism of failure. Following this [Failure Analysed?], it needs to be determined if this failure mode has been analysed previously using RCM / PMO logic. If it has not [N], then it should be put through an RCM / PMO analysis [Apply RCM / PMO]. If it has been reviewed [Y], then the validity of the previous review needs to be assessed against the fact that the failure has now occurred unexpectedly [Failure Prevented?]. The previous analysis may have recommended a "No Scheduled Maintenance" policy in which case, the outcome was expected and no further action need be taken except if the failure has now become more of a problem than originally thought [Increasing problem?]. Then modifications and a revision of the RCM / PMO should be undertaken based on the decreased reliability.

If, however, the recommendation was for PM and the PM has failed [System Downfall], then the source of the problem needs to be identified and rectification action taken.

Clearly, to undertake this work, the organisation needs to have an efficient means of retrieving the maintenance strategy for any given failure mode. Once again, the need to conduct either RCM or PMO2000™ before deploying an incident management system is shown.

Strategic Data Collection

Data Types

Data collection in the maintenance environment can take many forms. It is important to collect data about plant condition and what maintenance work has been done. However, that data is not the data most needed for Reliability Assurance. The data required for Reliability Assurance is data relating to equipment failure and the circumstances surrounding that failure. There needs to be a clear distinction between these different types of data when setting up a strategic data collection system.

Even though the data required for Reliability Assurance varies between sites, the following generalisations apply to the vast majority of cases.

Data Elements

Machines exist 24 hours of every day they are on the company register. During this time, they may be in a number of states. Some of these states are listed below:

Having upgrades or modification
Not required for production
In transit or being changed to different products
In production
In planned maintenance, and / or
In breakdown maintenance after having suffered a failure and being repaired or running at a
reduced rate.

Many companies track these states and establish a figure that compares production time to total time. This figure, when de-rated with quality and throughput losses is often called Asset Utilisation or Total Effective Equipment Productivity.

This paper is primarily concerned with machine reliability and is therefore concerned only with the latter two points on the above list. It should be noted that this paper is restricted to analysis of evident failures3 as hidden failures by definition do not of themselves cause operational loss.

These two reliability elements can be expanded as follows:

Planned Maintenance
- Preventive Maintenance, or
- Corrective Maintenance
Breakdown Maintenance
- Expected Failure - Equipment breakdowns that have been assessed as “No Scheduled Maintenance”, or
- Unexpected Failure - Equipment breakdowns that should have been predicted or prevented.

Inherent in these elements are some concepts that need to be understood clearly in order that the Reliability Assurance approach makes sense. These concepts are explained with the assistance of the models

shown in Figures 5 and 6.

Inherent Capability Loss

The reliability and performance of any machine is determined by two factors. These are as follows:

The way the machine was designed, and
The way it is operated

Expected Failures

Failure characteristics and economics are such that for some failures, the defined maintenance strategy is “No Scheduled Maintenance” (NSM). This may be because of the two scenarios described below:

the failure is random, and the PF interval is too short to be of any use
because the cost of prevention is more than the costs of the failure.

This reality means that there will be a certain level of unavailability inherent in the design and operating conditions. Failure modes which have NSM strategies will inevitably become breakdowns and result in capability loss. We call such failure modes Expected Failures. This is because over the life of the asset, it is expected that such failures will occur and result in loss of production4.

Planned Maintenance

While some failures or breakdowns will be accepted as being inevitable, others will be prevented either through condition monitoring or fixed time replacement. Where these preventive actions require that the plant is taken off line, then the preventive maintenance is another loss that is inherent.

In addition, condition monitoring may detect the onset of failure. The rectification action taken in such cases may require the plant to be taken off line.

All of these losses combine to form the Inherent Capability Loss shown in Figure 5. The Inherent Performance Level is therefore the total time less the Inherent Capability Loss.

Maintenance Capability Loss

If the PMO/RCM maintenance analysis was done correctly, and the machine is maintained and operated according to the approved process, then it should suffer no unexpected failures. This is not to say that the plant will not fail, what it means is that all the failures the plant experiences will be expected. The reality in most organisations is that some failure modes that receive PM will fail unexpectedly which means that some failure modes that have preventive maintenance activity, occur during production.

These losses are shown in Figure 6.

Data Collection Systems
The steps suggested are discussed in the following paragraphs.

Step 1 – Setting up a Generic Data Collection System

Setting up a generic data collection system is not difficult in theory. In practice, data collection usually involves people in the collection of the data, input into computers, and its use. The following is a list of important factors that should be considered early in the development of the data collection system.

Data collection systems often have a large number of interested parties or stakeholders. For this reason, data collection strategies should not be created by a single person with only one agenda in mind.
Data is often collected by people who operate machines. The degree of literacy and numeracy should be assessed and considered. Avoiding written notes by using codes is a good idea.

Consider what types of codes are applicable as the reporting loss codes determine the reports that can be generated.
In most cases, someone has to enter the data into a database. It is important to minimise the effort involved in keying in information.
In some cases it is important to collect data about rate variance and quality loss. If these are important, then there should be plans put in place to account for these losses. Sometimes this data can be difficult to obtain accurately however, it is often worth making some assumptions and implementing something rather than waiting for the perfect solution to be found.
The system should be reconcilable. This means that the actual output plus the losses should amount to the standard rate multiplied by the time in production.
Information about plant failure is best input after the fault has been corrected. The system should be such that codes are not entered at the time the fault occurs. They should be entered after the rectification is complete.

Step 2 – Establish Current and Inherent Performance

Establishing the current performance of an asset can be done when the data collection system is implemented. It may take some time to generate sufficient data to understand the average performance levels as there could be quite a bit of variation over time.

The establishment of the inherent performance level assumes that RCM or PMO2000™ has been undertaken. The process commonly used to determine inherent performance level is to collect all the loss data from a recent period long enough to be valid and consider what failure modes are treated with PM and which ones will be left to repair when they fail. By making the assumption that the failure modes that now have the PM done at the correct interval will have been planned maintenance activities rather than breakdowns, then it is possible to predict what the performance would have been under the new maintenance strategy.

Step 3 – Determine Defects and Quantify

Them From the data gathered, or from discussions with people close to the plant, determine what the main causes of downtime are and attempt to quantify them.

Step 4 – Develop and Implement an Improvement Plan

Conduct workshops to establish causal relationships and create an improvement plan. It is highly recommended that the workshops involve the people who operate the plant and collect the data as this will build a sense of ownership in the improvement plan.

In this step, formal RCA workshops can be used.

Step 5 – Revise Data Gathering System

Once the improvement plan is created, the categories for improvement need to be reviewed to ensure that it can be determined whether the improvement plan is working.

Step 6 – Track the Success of the Plan and Revise if Necessary

The data collected should indicate if the improvement strategy is working. If the strategy is not working then a new one needs to be created or the problem needs to be listed as inherent in the system.

Common problems with Incident Management Systems

The most common problems organisations have with incident management systems are as follows:

The foundation of PMO work has not been done.
There is no-one in the organisation that is responsible for administering the system.
The system is fragmented and cumbersome. Reports get lost and nobody is keeping track of things.
Getting information is time consuming. CMMS, OEE or PMO systems are not integrated.
Too many incidents are being investigated at once.
Data collection strategy is too generic and lacks definition.
- There is little knowledge of what problems are being looked at and for what reasons.
The systems are too cumbersome to get data into and get data out.
Too much data is being collected.
There is more than one system collecting the same data. This:
- Frustrates the people who collect the data, and
- Leads to arguments about the data rather than a focus on solutions.
No-one tracks whether the improvements worked or not… they are not integrated with the
OEE system
The people that collect the data are not involved in using it to solve problems.
- This results in poor data quality.
- The data is collected at the wrong level and can not be interrogated according to the necessary parameters. For example in a manufacturing plant, data may be collected against the line and so investigations comparing performance of different products can not be easily done.
Manager’s belief that their CMMS systems can be configured to perform Reliability Assurance.

References:

Moubray J M (1997) “Reliability – centred Maintenance”. Butterworth - Heinemann, Oxford

Nowlan F S and Heap H (1978) “Reliability – centred Maintenance”. National Technical Information Service,
US Department of Commerce, Springfield, Virginia.

Ledet W (1994) “Rational Considerations – Systems Dynamics Model (The Manufacturing Game)
“ Goal/QPC Conference Boston, MA, USA November 1994.

Turner S J (2001) “PM Optimisation – Maintenance Analysis of the Future” ICOMS Annual Conference
Melbourne 2001