A business succeeds by maximizing its assets' life-cycle net present value.
In the short term, every Production Plant, Distribution Company or Service Facility has to ensure that it is profitable. For this, it must be able to work when it is required. Stated differently, it has to be available on demand.
In the long term, an additional requirement is that its Technical Integrity (TI) and longevity are ensured. TI is defined as the absence of foreseeable risk of failure during specified operation that endangers the safety of people, asset value or the environment. Implied in this definition is the possible harm to the reputation of the Company. Its reputation and TI determine whether society will provide the Company with its licence-to-operate.
The design must be such that both these objectives can be met with reasonable effort and cost.
Outputs of Maintenance
We can expect to achieve these results with an effective maintenance effort. They are the primary role and justification for doing any maintenance. The raison dêtre of maintenance is to minimize the quantified risk of loss of TI or production capacity that can reduce the profitability or viability of the Company1.
Single and Multiple Failures
Some failures have an immediate and direct effect on performance. Thus a seal leak from a pump can have an immediate impact on safety, environment and/or production. A punctured tire means that you cannot use your car; if the puncture occurs while you are driving, your safety is at risk.
Other failures have no immediate effect. Thus, if you car's brake light bulb has failed, something else has to happen before there is a consequence. At the time you use the brake, there must be a driver behind you who does not notice you are slowing down and hence crashes into your car. Had the brake light worked, the person would have known you were braking and would have automatically started slowing down.
Failures that have no direct and immediate effect are often much more hazardous and could lead to serious events. Imagine that your spare tire has a very low pressure and also that you suffer a tire puncture while driving. Luckily you are able to pull over safely, but you may still face several levels of problems. If you are on a city street, there is the inconvenience and delay of getting your spare tire checked and re-pressurized before you can proceed. If you are in a deserted and lonely place at night, your safety may be at risk. Obviously there can be other many more unpleasant scenarios.
Similarly, if the over-speed trip device on your steam or gas turbine fails, by itself this may not matter. If the load drops off suddenly, the turbine will accelerate rapidly. Now the failure of the over-speed device matters; the turbine may throw its blades, causing major damage and perhaps serious injury.
The second type of failure is termed hidden or unrevealed, as we do not know that the item has failed till it was called upon to work. These relate to items that remain dormant for most of their life, and called upon to work only when something else happens.
These are devices, equipment or systems that protect other systems or equipment from hazardous situations. Protective systems may be designed to work at the equipment, process system or Plant level. Examples of such devices or systems providing protection at the equipment level include, e.g., Pressure Relief Valves (PRV) on vessels, over-speed trip devices on high speed machines, relays protecting generators or transformers, centrifugal brakes on elevators and axial-displacement trips on large rotating machinery. At the system level, we have for example, pressure control or blow-down systems, smoke or fire detection systems, emergency shutdown systems, etc. At the Plant level, we have for example, emergency depressurizing or shutdown systems and fire protection systems.
Protective systems lie dormant for most of their life. They are only required to operate during an emergency situation. During their idle periods, the moving parts may become stuck by, e.g., gumming or due to deposits from process fluids, electrical contacts may be shorted by moisture or insects, or in other cases, there may be leakage of pressurized actuating fluids. Loss of function of these protective devices or systems exposes the protected equipment or system to single-event failures, since the protective barrier has been lost.
Protective devices or systems are installed when the hazards are high. If they do not work on demand, the consequences can be extremely high. For example, in the Bhopal disaster, the scrubber system which was designed to prevent toxic vapours from escaping into the atmosphere had been left out of service. By itself, this action did not cause a problem. However, when toxic vapours were generated on December 3rd 1984, these were released to atmosphere, because the scrubber system was not available2. Several thousand people died and many more were seriously injured.
The design of equipment is increasingly more complex, and many have software based control systems. Operator intervention and surveillance tasks have reduced substantially. Many of their data logging and other routine tasks have been transferred to the control and supervisory systems. Equipment productivity as well as costs have gone up as a result of their increased complexity. As a result, downtime and repair costs have also increased significantly.
One of the features of modern equipment is the level of protection provided to guard them from damage. Similarly, modern Plant designs are more automated and require fewer operators. For these reasons, we rely increasingly on protective devices and systems to safeguard our assets. Earlier we noted that failures of protective devices and systems are hidden, so the operator does not know at any point in time whether they will work on demand.
Assuring Technical Integrity
There are many inputs required to ensure TI. One of the important elements is the availability of protective devices and systems.
A new or completely overhauled device or system (to As Good As New or AGAN standards) has a survival probability of 100% on installation. Thereafter the survival probability (or reliability) falls. We can never know its reliability at any point in time, but we can estimate its value. We call this its expected value and to estimate it, we have to make certain assumptions. One of them is that the reliability at any time t is represented by a continuous curve. Another is that this curve follows a known distribution; often, we assume it is exponential.
If we accept this basis, certain conclusions follow, namely,
- The reliability is 1 only at the time of installation
- It falls thereafter, the rate of decline depending on the distribution curve
- At some point in time, the reliability will be unacceptable; this level is determined by the risk we are willing to tolerate
Repairable and non-repairable items
We classify items that are replaced as a whole as non-repairable. Strictly speaking, some of them may in fact be dismantled and repaired off-site later. Others will be discarded and replaced with new items. In both cases, the replacement is AGAN and has a reliability of 1. Examples of such items include light bulbs, printed circuit boards, gas detectors and ball or roller bearings.
We treat another group of items as non-repairable, those that are subject to hidden failures. In this case, the operator cannot know whether the item is working or not under normal circumstances, since it is lying dormant. We can find out its condition by testing it periodically. As discussed earlier, such items affect TI, so when we know that one of them is defective, we will be replace it immediately with a new one. For example, once you know that your car's brake light is burnt out, you will replace it fairly quickly.
For such items, the availability on demand, i.e., the probability it will work at any time t is identical to its reliability, because in this situation, that also has the same value. Thus the availability of an item or system subject to hidden failures is the same as its reliability. If we know its failure distribution curve, we can compute the reliability and hence its availability. Note that this argument applies in the special case of hidden failures.
The first question we have to ask is about the level of risk we are willing to tolerate. As you are aware, quantitative risk has two components, the probability and the consequence of failure. The latter is dependent on specific situations, each with different exposure levels. For example, a gas turbine (GT) blade fracture is a very serious failure. It could cause significant equipment damage, injure people and may result in multiple fatalities. If the GT was driving a gas compressor in a remote and unattended compression station, the consequences may be limited to equipment damage and production loss. If it was located in a Process Plant, it could injure or kill people in its vicinity. If it was fitted on a Jumbo jet aircraft, hundreds of passengers may lose their lives due to hull damage. Such an incident occurred with United Airlines flight UA 232 in July 1989. In this accident, the third engine of the DC-10 aircraft, located above the tail-plane, shed its blades. The shrapnel cut through the redundant (3 x100%) hydraulic lines, causing loss of all flight controls3. In all 112 people died, but as a result of some extraordinary disaster management, 184 people survived the crash-landing at Sioux City.
Clearly the risk levels will vary, depending on the circumstances. The expectations of protective system availability must match our evaluation of risks. For protective systems, we can expect their availability to be from about 97.5% to over 99.5% depending on the circumstances. We must carry out the maintenance work (whether preventive, predictive, corrective or detective) on or before the time the value of survival probability reaches this level.
The only way to find out if protective devices or systems will work is to call on them to operate. If there is a real demand and the item works, then of course we know it is in good order. But we cannot wait for a real demand to find out its state, so the alternative is to test them.
In their seminal work on Reliability Centered Maintenance (RCM), Nowlan and Heap4 called these tests Failure Finding Tasks. Moubray5 uses the term detective tasks. Adopting the latter terminology, we call the strategy which employs detective tasks as Detective Maintenance.
Detective maintenance is a necessary (but not sufficient) activity required to guarantee TI. Detective tasks include, for example:
- Testing of smoke, gas and fire detectors
- Periodically starting fire pumps
- Testing the ejector seats of fighter aircraft
- Building evacuation tests
- Stroking of valves that stay in one position for most of the time
- Annual vehicle inspection
- Testing of emergency disconnect/release systems on cargo ships
- Pre-overhaul testing of PRVs
- Testing control loops of safety devices
- Testing of relays protecting electrical equipment
- Over-speed trip tests
- Furnace/Boiler trip tests
- Periodic testing of (fire-protection) deluge valves and sprinkler systems
This is not an exhaustive list, and you will recognize many of them from your own experience. The testing may be done by operators or maintainers, but the work itself is a maintenance activity. The test reveals the following:
Whether the item is in a working or failed state, and
The failure mode
Detective tasks are not condition-monitoring activities. The latter track ongoing failures, measure trends, and predict the time-to-failure. Examples of condition-monitoring tasks include, e.g., vibration monitoring, oil-debris analysis and thermography. In these cases, the degradation process has commenced, but the item has not yet functionally failed. Detective maintenance tasks are only applicable to items in one of two discrete states. They are either working or have already failed.
Protective devices or systems can fail in one of two ways. The first is when the item does not work when there is a demand. This is why the item was installed in the first place, so this failure is likely to cause an unsafe condition. We call them fail-to-danger events. The second situation is when the item works when there is no demand. It may then cause, for example, a spurious trip resulting in loss of production. We call these fail-to-safe or nuisance events. If there are frequent spurious events, there is a good chance that operators will routinely ignore all such events, whether they are genuine or spurious, and restart equipment without proper checks. This in turn can lead to unsafe situations.
Detective maintenance only reveals the state of the item; with this knowledge we have to take further action if the item has failed. So this task can generate further corrective maintenance tasks. Failed items are usually replaced immediately, to minimize downtime. The replacements are AGAN. (As an exception, Instrument drifts or span changes are often corrected on detection, without generating corrective work orders, since the actual corrective work is quite small). These actions bring the item back to a 100% reliable condition.
Practical difficulties with Detective Maintenance
Some protective devices can be tested during normal operation, without requiring a shutdown of the whole Plant or Facility. The sub-system or system may be isolated for short periods while the tests are in progress. Thus smoke detectors or fire pumps can be periodically tested without direct impact on production or safety.
Trips and shutdown systems have three active elements. The sensor or detector identifies unacceptable deviations from the norm. The output from the sensor(s) goes to a logic unit. This uses a given set of algorithms or software codes to compute an output signal. The output signal is sent to the actuator of the executive element. Some examples of executive elements are:
Emergency shutdown valve
Trip valve of a turbine
Deluge valve in a sprinkler system
Ejector seat of a fighter aircraft
Testing the sensors or logic unit usually does not pose a problem. The output from the logic unit can be defeated during the test so that the executive element is not actuated. The problem lies with the testing of the executive elements. Actuating them during normal operation will result in a system or Plant shutdown with direct production losses. Such trips may also create unwanted safety hazards.
Executive elements can always be tested during planned shutdowns. These are not very frequent, so waiting for them may not give us the required protective system test interval to assure its availability. Plant shutdowns are getting even less frequent, so there are fewer opportunities to carry out detective maintenance.
This of course raises a practical difficulty; how can we ensure TI without being able to carry out detective tasks at the right frequency? One answer is if we have to plan short cleanout shutdowns, these provide an excellent opportunity. Any trip of the Plant can also be used, provided we are adequately prepared. For these, the Planning system must be nimble!
The traditional answer has been to carry out function tests. In these tests the executive element is defeated or gagged. As long as the actuator moves or the appropriate hydraulic oil pressure is seen, it is assumed that the executive element will work. Large rotating machinery such as centrifugal compressors or steam turbines have trip valves which stop the machines in an emergency. Such trip devices often have ‘dead-man' controls, which require a hydraulic of mechanical force to be continually applied to keep the power source live. On receiving a trip signal, the hydraulic oil is ‘dumped' to sump, or a trigger releases the mechanical force holding the valve open. Based on this data, it is assumed that the steam, gas or electrical power source will be isolated, stopping the machine.
This last assumption is not always correct. In practice, a number of factors may prevent the steam or electrical power source from being disconnected. Common faults include - sticking of the valve or trigger mechanism due to gumming, dust and other deposits, bent or distorted stems, welded contacts in electrical circuits.
Over-speed devices have an important safety function. It is not enough to function-test them. The machine must be run up to the trip speed, and this may require the coupling to be removed.
Function tests may be acceptable as intermediate tests, since they prove that most of the system works - only the executive element does not move, so the machine does not actually stop. We can reduce the risks by keeping the trigger mechanism, stems etc. dry, clean and lubricated, thus increasing the chances that the executive element works. Remember however that the only guarantee is when the machine actually trips!
Using installed spares and on-line testing methods
We are legally required to test PRVs at an acceptable frequency. If there are two full-capacity PRVs installed with one in operation (1 out of 2) with interlocked isolation valves, we can take one out during normal operation for testing.
Commercial on-line relief valve testing methods are also available. These can be used as intermediate tests when shutdowns are not possible. They do test the operation of the PRVs under simulated conditions and verify that the valve lifts and reseats properly. However, PRVs have to be visually inspected internally for deposits, and cleaned if dirty. For this reason, we cannot avoid removing them periodically from their location.
Similarly, we can start and load standby equipment such as pumps, compressors or emergency power generator sets. Such tests confirm that the equipment does start and that it will take up the full load.
We test individual gas detectors in-situ by exposing their sensor units to a known concentration of test gas. Similarly smoke and fire detectors are exposed to appropriate tests. Control panel lights are checked by lighting up all of them using a test switch. In all these cases, the tests are under controlled conditions, isolating signals from reaching the executive elements.
If the plant comes down for any reason, be it a trip or cleanout shutdown, we can use the opportunity. For example:
We can test PRVs at that time. They don't need an overhaul, only a test, so we confirm that the mechanism operates and gets de-gummed. We reset only those relief valves that need adjustment. This way, we can re-install them quickly, and not prolong the plant outage.
We do over-speed tests at the beginning of planned shutdowns, so that if there is a fault, we get some time to fix it.
Whenever a Plant or Unit has to be shut down for planned work, we use a different trip system each time to do so, thus using the opportunity to test each one in turn.
Such work needs careful planning effort and rapid response.
Partial closure/opening tests
Emergency blow-down or shutdown valves and similar mechanical devices can be operated partially, by limiting the movement of their shafts to say, 2 to 3 mm. This can often be done by inserting a mechanical stop to prevent further movement. For large hydraulically operated valves, special devices are available to allow controlled partial movement. These tests ensure that the valve actually moves slightly, breaking off any gumming or jamming deposits. Partial closure tests give some, but not complete assurance of integrity, so they are useful intermediate tests. Whenever a plant has to be shut down on a planned basis, we should operate these emergency valves to full stroke, so that we have confidence in their integrity.
It can be shown that for a given failure rate of a protective device, the required availability can be obtained by adjusting the test frequency (see pages 34-40, reference 1). The failure rates must be for the appropriate failure modes. For example, if we wish to compute the test frequency for a trip valve closing and stopping a machine, the corresponding failure mode is ‘fail to close'.
A successful business must generate profits while operating safely. The latter needs an acceptable level of Technical Integrity (TI). Hidden failures are major contributors to the loss of TI. Detective maintenance strategies help identify such failures, and are therefore important.
Equipment and Plant designs are increasingly more complex. They are generally larger, more efficient and require less operator attention. If they are down for any reason, the cost of lost production can be very high. They are therefore equipped with protective devices.
Protective devices are generally dormant for much of their life. The operator does not know if they will work on demand, and their failure modes are hidden. In order to manage risks properly, we have to be sure that their availability is acceptable. We do this by testing the item at the right frequency, and call this strategy ‘Detective Maintenance'.
There are some practical difficulties in implementing this strategy. These have several possible solutions, none of which are perfect, but suitable in specific applications. They do not replace the normal test procedures, but can be used to extend the interval between the normal tests.
What matters is that we recognize the importance of detective maintenance strategies and demonstrate our commitment to reaching the required TI levels. Our licence-to-operate depends on a successful detective maintenance program.
1. Narayan V. 2004. Effective Maintenance Management - Risk and Reliability Strategies for Optimizing Performance. New York. Industrial Press Inc., ISBN 0-8311-3178-0
2. http://www.bhopal.org/whathappened.html (accessed 3rd February 2005)
3. http://www.airdisaster.com/eyewitness/ua232.shtml (accessed 3rd Feb 2005)
4. Nowlan, F.S., and H.F. Heap. 1978. Reliability-Centered Maintenance. Washington D.C. U.S. Department of Defense. Unclassified, MDA 903-75-C-0349.
5. Moubray, J. 2001. Reliability-Centered Maintenance. New York. Industrial Press, Inc. ISBN: 0-831-131462.