Reliability Reality in Process Plants - The Archimedean Leap from the “Bathtub”
Archimedes was born and lived in Syracuse, Sicily, which at that time was part of the Greek Empire. A brilliant scientist, engineer and mathematician, he is credited with many famous concepts, inventions and statements. These include the statement, "Given a long enough lever and a place to stand, I can move the world".
Other credits include the Archimedean screw for pumping water, the Compound Mechanical Pulley, Hinged Mirrors, which could focus the suns rays like a modern laser gun, and Mechanical Catapults for throwing boulders. In addition, in the field of mathematics, he invented pre-calculus and was first to use the calculated the value of Pi.
He is, however, most famous for jumping out of his bathtub shouting, "Eureka! I have found it!" when he realized that the volume of water that he had displaced weighed the same as his floating body. Or, could there have been another reason why he made his flying exit from the bathtub?
Fig. 1- Archimedes Contemplates Taking a Bath
Concepts and Confusions
To paraphrase a quotation from the ‘Reliability Edge' web site, "The concept of flat earth is not now widely defended, but the unsupported assumption that most reliability engineering problems can be modeled well by the exponential distribution is still widely held. In a quest for simplicity and solutions that we can grasp, derive and easily communicate, many practitioners have embraced simple equations derived from the underlying assumption of an exponential distribution for reliability prediction, accelerated testing, reliability growth, maintainability and system reliability analyses. This practice is perpetuated by some reliability authors and lecturers, some reliability software makers, and most military standards that deal with reliability".
Our experiences with the effect of these assumptions being alternately over-simplifed, and then complicated, with "Polynomial Contrasts", Latin Squares and Weibull Analyses result in the bathtub curve being transformed into a powerful source of misinformation.
It must be added that until fairly recently, Polaris consultants were also unwitting "parties to these accepted fictions", using them to explain reliability theory which has been a mistake.
Automotive Analogy 1: Could the bathtub curve describe the risk of failure against time for a car driven at reasonable speeds on good roads, being serviced regularly and having small defects identified and fixed Promptly?
However convenient for mathematicians and relevant this concept might be, in the case of single items of equipment operating in steady-state, it certainly does not represent what happens in typical process units. In fact, it doesn't even represent what happens in a typical automobile operated under perfect conditions.
To explain the development of a more realistic representation, we first came up with the graph below showing the "modified stepped version". Each step in the illustration below represents permanent damage/reduction in useful life caused by a period of upset operating conditions.
Before fully developing the explanation, it is necessary to accept that in most plants these process upsets occur on an irregular but persistent basis. They also often go unreported unless they cause major equipment damage or product loss.
One of the many ways to recognize that one of these upsets has occurred is the simultaneous damage, or "cluster failure", of minor items of equipment. Frequently, they do not bring production to a halt and are typified by pipe and equipment leaks, failures of groups of mechanical seals in pumps, filter failures, etc.
In simple terms, the steps represent a process upset involving a damaging change in pressure, temperature, flow or chemical composition. Each one in some way reducing the life of the unit. The root cause can be some external failure, but, in our experience, it is much more likely to be due to some lack of operational discipline, i.e. a failure to stay inside the operating envelope or failure to accurately follow operating procedures. The term lack of "Operational discipline" does not refer to an intentional act or some malevolent behavior but errors caused by the use of deficient operating procedures, training in the operating procedures, conflicting priorities, inadequate labeling of equipment and instruments or lack of effective administrative controls.
Automotive Analogy 2: The stepped graph describes the probability vs. time of a vehicle driven outside normal conditions (too fast) so that it overheats, is driven in very dusty conditions, gets the wrong grade of fuel for extended periods or suffers a series of missed service(s) but doesn't break down. Sometimes, a hose will blow or cooling water pump seal fail, but, most often, there are no visible effects of the abuse for a couple of hours.
The third diagram shows the already modified "stepped bathtub curve" with the addition of a series of spikes.
Each spike superimposed on the step represents a short period of greatly increased risk of failure. Typically, this can occur during a period when the unit is being either shut down or started up, but it can also be caused by a sudden loss of utilities or a raw material variation such as a power surge / lightening strike, or raw material interruption etc...
Modern continuous production processes are designed to run without stopping for many years (two to seven years is typical) and, while such incidents are few in number, they are very significant. These crash shutdowns and hard starts play havoc with prospective equipment reliability, and every effort to avoid them should be taken.
The concepts hold true for batch process, but they see very much fewer damaging "steps" and many more, but much smaller, "spikes" of increased risk.
Comparing the idealized bathtub curve with this new stepped-spiked curve, we notice several things:
- The unit's total life (reliability) is frequently reduced by 50% or more.
- In the final months it is possible to get the "Tipping Point Failure" where a small spike superimposed on a step causes a crash shutdown.
- As the overall condition of the plant deteriorates due to the "steps", it takes fewer and less severe or a shorter and shorter spike to tip the unit into such a shutdown.
- Understanding the situation allows a reduction in the frequency and mitigation of the effects. It is then possible using sophisticated condition monitoring techniques to know how much remaining life all the major items of equipment have left. This is critically important in order to plan and schedule the next outage.
Automotive Analogy 3: The graph describes the probability vs. time of a vehicle given "hard starts" and "emergency stops" in bad conditions, such as street drag racing.
Suggestions for Dealing with the Steps
(Archimedes - the Siege of Syracuse and Long levers)
As the Romans laid siege to Syracuse, Archimedes used all the war machines in his armory. These included catapults for long range defense and crane-mounted swinging grappling hooks to capsize the Roman boats at short-range. He used his knowledge of "mathematics and mechanical advantage" in all its many forms. Similarly, we must use a combination of short and long-range approaches to minimize, if not get rid of, the "steps" and "spikes".
Fig. 5 - Suggestions for reducing the size and frequency of the "Steps of Increased Risk"
To a very large extent, a much-improved level of "operational discipline" is required. This can also be described as the ability of the operating crew to keep the unit operating within a prescribed operating envelope. As a first step, this involves an improved method for compiling clear instructions on how to operate and trouble-shoot the plant under all conditions, i.e., the "standard operating procedures". (We recommend the "T" bar or two-column format for its simplified architecture and easy reference.) The procedures should be - well engineered, identify the consequences of deviation and corrective actions that need to be taken, provide a separate trouble shooting section, cross-functionally prepared, and be easy to read. They should also be based on tightening operating limits and be set in a background of continuous improvement; meaning they can be improved by the agreement of the group at any time.
Monitoring the process very closely using detectable symptoms other than process instrumentation is, again, essential. There is almost always some indicator or grouping of observable defects that will provide earlier warning of deviation and potential failure. Be mindful of the chaos, which can occur without good alarm management strategies. Simplistically, lowering the alarm levels on equipment is not the way it should be done. Search for specific critical small changes in product quality, noise, smell, etc. Putting together the knowledge of the operators who have been around the process for years and appropriately directed engineers it is possible to produce information on some early or incipient change that will allow pre-emptive action to be taken.
There is a need to create a participative open culture where prolific incident, near miss recording and follow up is routine. Formal root-cause analysis for larger, potentially high-dollar incidents is one of the ways to achieve the improvement needed with medium sized single discipline issued being handled by expert engineers. All of this is predicated on not being afraid of reporting and investigating these "process variations", which is not the case in many plants.
The diagram of Stress Level vs Error Rate illustrates two essentials. First, if a plant operation is very stable for long periods, operators can be mentally asleep with their eyes wide open. Secondly as stress levels increase only those actions driven by things that were learned either by repetition or those where clear and simple "Directions and Signs" are given will be effectively executed. In dealing with an "Upset" there is little time for discussion, multiple references or involved calculation.
We are aware that this can sound like "motherhood and apple pie". Our experience has been that it takes a series of very long (Archimedean) levers to exert enough force at all levels of the organization to implement these necessary changes.
Suggestions for Dealing with Spikes
Spike removal involves a similar but two-part solution. For the 60% of spikes that occur at scheduled start-up and planned shutdowns, more detailed procedures and comprehensive checklists are needed. Plants should take a lesson from the aviation industry where even a small spike can be fatal which is why they make extensive and effective use of checklists.
This approach should not be new to the process industry, as it is an OSHA requirement for plants operating under the Process Safety Management code. OSHA mandates the use of the Pre-Start-Up Safety Review Process, which is a similar use of "checklists". Take the time to write, train and re-train operators in the use of checklists. Making sure that these thorough checks are made is critical to dealing with the spikes efficiently. The written procedures with individual responsibility for a set of required actions must have been set out and rehearsed before hand.
For each one of the remaining 40% of spikes due to sudden utility or raw materials interruption, a "detailed and rehearsed mitigation action plan" is needed. A well-written procedure and a rehearsed response should be already worked out for each type of spike before they happen. These comprehensive training and routine drills are essential. The safest plants have routine safety drills, why not drills for predictable emergency operating conditions?
What Not to Do!
In the past, many plants used the simplistic approach that if moving slowly reduces the chance of error (spikes) in "pressure situations", then moving very slowly is best. This is, however, patently untrue. Holding processes through extended periods of low flows through heat exchanges and through filters can render them 50% fouled before the plant is back up to full operating rates.
As an example of this, the chosen materials of construction for machinery (pumps, fans, turbines) are selected for use during normal operating conditions. Operating them away from their design conditions usually means misalignment and excessive corrosion rates, which frequently damage them.
Automotive Analogy 4: No matter how badly you drive, if you drive slowly enough, you will never have a serious accident.
Defect Elimination and Reliability Improvement
Lead by DuPont in the 80's, extensive research into the sources of problems leading to inefficient operation has resulted in some surprising results. Based on thousands of "Root-Cause Analyses" of plant problems, root-cause defects, whether measured numerically or by Total Financial Impact, are distributed as:
Maintenance Practices 18%
Maintenance Materials 7%
Raw Materials 5%
Design 25%
Operating Discipline 45%
We have several examples of companies in the process industries equating a reliability expert as a "good maintenance guy". This mind-set is then taken to its illogical conclusion by the recruitment of reliability engineers to fix "maintenance problems" when the individual should instead be working on "reliability problems". These management groups then have the engineers' work on maintenance practices and materials problems, which comprise only 25% of the causes of reliability problems. Typically after 2 - 3 years of hard work, these engineers will have reduced the maintenance number by 50%, thereby solving or removing 12½ % of the total number of defects or reliability problems.
Meanwhile, they will have missed the opportunity to go after the fundamental problem areas of operating discipline, primarily because they involve people, communication skills and cultural changes rather than equipment. These engineers in the same amount of time, had they expended the same effort on solving 50% of the operating discipline problems, could alone have made a much more effective improvement in the overall plant performance by removing 22.5 % of the defects. While all this is understandable, much like "the story of the king's new clothes", operating management's belief is often that "it can't possibly be that bad" when the evidence clearly indicates that it is.
Fig 6 - With the Steps and Spikes removed a Happy Archimedes takes his bath
Working on reliability improvement programs in over 20 plants we have found that in dealing with "design defects", the 80/20 rule applies. Many can be removed for a very small amount of money (the low-hanging fruit). However, rectifying fundamental errors in the choice of essentially unreliable equipment and systems versus immediate capital effectiveness does invariably take major capital expenditure.
Conclusion
Understanding the possible savings and the safety improvement that can be made by using improved operational discipline should be one of the first steps, not the last, in achieving world-class operating performance. Each plant must have a set of periodically reviewed, accurate, and easily referenced set of operating procedures, and the operators must be trained and routinely retrained in their use.
In addition, as a means of administrative control, most plants need daily formal reporting and investigation of even small process variations (PDR's). Even so, we see errors covered up, and the "Steps of Increased Risk" go unreported.
What is needed is a culture in which minor defects are identified and fixed earlier, and a situation is created where even minor changes in process performance are measured and reported. (Using the daily "process deviation report")
If the difference between expectation and reality is happiness, then the more we understand, the happier we will be. We can now see that Archimedes was jumping out of the reliability bathtub because it wasn't the smooth shape he expected it to be. He, like many other reliability professionals, was in the process of being impaled on a "spike" of ignorance that his fellow mathematicians and theorists hadn't told him about. Just by knowing that these spikes and steps are there and having the whole production team work on removing them will improve the plants operating performance.
Originally presented by Bernie Price, Polaris Veritas Inc., at Originally presented at IMC-2004 - The 19th International Maintenance Conference