Don Kuenzli, as a plant manager, successfully transformed two oil refineries to highly reliable and sustainable performance by creating a culture of Continuous Improvement through defect elimination. He had a vision to create “a world class facility with pacesetter performance” for these refineries. When Don shared his vision with his boss at one of these sites, he was told that they could not afford to create this level of performance. Managers often make the assumption that the best performance is gained by paying a high price. There are many benchmark studies that demonstrate quite the opposite.
In the DuPont benchmark study, the most reliable performance in the world was Total Productive Maintenance. It was also the least expensive to achieve and sustain. The assumption that high reliability is expensive comes from the fact that the vast majority of efforts to gain high reliability are misguided. The conventional wisdom is that maintenance best practices are planning, scheduling, preventative maintenance, optimized procurement, and predictive maintenance. While these practices in fact make maintenance much more efficient and effective, they do not address the most important aspect of reliability. Why is the equipment failing in the first place? Other best practices such as reliability centered maintenance are a step in the right direction but still do not address the largest root cause. In the ABC’s of Failure (TMG News April, 2008), we concluded that approximately 84% of the defects that lead to failures are in fact created randomly by careless work practices throughout the entire organization.
For those who have not seen our earlier article on the ABC’s of failure, we concluded that 4% of the defects are due to aging of equipment, 12% of the defects are due to basic wear and tear, which leaves 84% due to careless work processes. If one starts an initiative to improve reliability based on the conventional wisdom, he might expect to improve maintenance practices by implementing more preventive maintenance. However, by scheduling more frequent preventive tasks on equipment that we already do some amount of preventive maintenance or by expanding preventive maintenance to equipment that was not included in the preventive maintenance program before, we can only succeed in removing the defects that are created based on the passage of time. That only includes the aging and basic wear and tear defects, and they represent only 16% of the defects. People get very frustrated when they go beyond that 16% because it becomes apparent that the probability of adding a defect while over doing the preventive maintenance is higher than the probability of removing a defect. This also becomes very expensive and wasteful because work is being done to change parts that are in fact not defective. During 27 years of experience at DuPont, we went through about seven cycles of increasing preventive maintenance to the point of frustration and then abandoned most of the preventive work. Although important, preventive maintenance can’t solve all of our problems, and we are wrong to expect it to be more than it is.
The other best practice that everyone recognizes as having merit is predictive maintenance. In this case, it is recognized that failures are not predictable with time alone but depend on how long it takes for a defect to propagate to a failure event. It also depends on how good the technology is in detecting that defect before it becomes a failure. Reliability programs for predictive maintenance concentrate on getting the requisite variety of detection technologies to find defects soon enough to allow for orderly planning and scheduling. The computer model of the DuPont benchmark facilities, found that the number of inspections required to ensure that >90% of the defects would be detected before the failure event occurred was so high that 97% of the time an inspection did not detect a defect at all. The difficulty with this approach is that it is demoralizing to sustain that kind of diligence over long periods of time except where the consequences of failure are catastrophic. Nuclear power plants are a great example of a facility that warrants this level of diligence and a good place to see how predictive maintenance can be very effective.
The problem in other facilities is that the cost of this kind of diligence is not competitive because the processes and equipment are much more complex than the simple process of boiling water to generate electricity. The experience with predictive maintenance at DuPont was similar to the experience with preventive maintenance. We started many predictive maintenance initiatives and succeeded until a routine operation was going, but then someone looked at the results of the inspections and decided that inspections were being overdone because we only found defects 3% of the time. This led to abandoning many predictive maintenance technologies. I once admired the fact that a mechanic in DuPont knew ten different technologies for predictive maintenance. I asked him how he learned so many technologies. He said that he had been doing predictive maintenance for fifteen years. I replied, “But this initiative is only one year old.” He said, “Yes, but this is my ninth initiative.” A few years later, we declared victory, dissolved the corporate maintenance leadership team that was leading the predictive maintenance initiative and completed the cycle once again.
So why did these initiatives consistently fail? The problem is not that they were pursuing the wrong best practices; it was that they failed to attack the larger problem of the randomness in the failure rates. When 84% of the failures are caused by random lack of discipline to operate, maintain, design, procure, and/or improve equipment, there is no efficient way to deal with the defects that get generated in these careless work habits. To cope with these defects, many companies try to solve the problem by adding spare equipment. This just adds to the amount of equipment that has to be maintained and therefore increases the expense to procure and maintain this extra equipment. One of the worst ways to do this is to keep a piece of equipment that has been replaced by a new one. We have seen many sites where the old piece of equipment is kept as a spare. In this case the maintenance cost is very high compared to maintaining the new piece of equipment, and it is simply there to use when the new piece is out for repairs. In DuPont the plant that had the best pump life, had zero spares. This decision caused them to treat the pumps like the precious assets they were.
As many of you have seen before, we use the stable domains to depict how reliability is generated by the behavior of the people. Below is the diagram in a simple form to illustrate another dimension to the picture.

People generally agree with this way of looking at how operations and maintenance of a facility can be classified in one of these domains. Over the last fifteen years, we have endeavored to show why the successful sites have skipped going to the planned domain. That domain is inherently unstable due to the randomness of the defects that exist and the other factors mentioned above. Although the planned domain makes the work more efficient, it does nothing to reduce the amount of work that must be done. In order to see more clearly why this happens, it is better to look at these domains from a different perspective. This other dimension is the amount of activity and therefore cost that is required to attain and sustain each domain. The figure below shows that view.

This diagram is representative of the improvement realized at the Lima refinery. The number of work orders was reduced by 67% over an eight year period. This transformation, however, did not go through the planned domain to get to the precision domain. As the diagram shows, the extra work required to do this in the planned domain would make it much more probable that they would have returned to the reactive domain than progressed to the precision domain. If they had undertaken the extra work to get to the planned domain, it would be logical to assume that another increase in cost and work would be needed to get to the precision domain. Fortunately for us, we had seen the data from the DuPont benchmark plants in Japan that had won the Total Productive Maintenance awards. In these plants they showed us that the amount of work, and therefore the cost of maintaining a highly reliable plant, was in fact even less than the work and cost of remaining in the reactive domain. In the 3D view these points have been combined to show that the precision domain has both the highest uptime and the lowest cost.

For equipment uptime an even better view is to look at this diagram as the 3D bar chart below.

Article submitted by Winston Ledet, Co-Author, Don’t Just Fix It, Improve It! A Journey to the Precision Domain









Comments (9)
Good Job!
1) Posted 9:00 am, 19 May 2010 by Robert Wegner
There is one thing that I would clarify. You use the term Planned Maintenance as a facility strategic approach and give some evidence to support by-passing this step. As I read I assume you are refering to a facility that makes Planned Work the "do all and be all of maintenance." Our experience is the same. Making the planning effort and beefed up PM and PdM your entire maintenance strategy will not guarantee high productivity and low overall cost. However as a component within a comprehensive reliability strategy it has a very important role.
Perhaps it is just semantics, but the category of reliability activities that we call "Work Management" includes Work ID & Control, Planning, Scheduling, Shutdown Management, and Supply Chain Management. In our model this is one strategic category. This area has been demonstrated to have a strong coorelation with reducing overall costs at our plants. Mills that perform well in the work management areas are also mills that have a low cost structure. At the same time mills that are weak in this area are higher in cost. Based upon our data there is not a clear correlation between planned work and OEE. The expected result of excellence in planning is increased employee (usually maintenance) efficiency. It is not necessarily improved effectiveness.
To your point we have found that Precision Practices, which we include in another Strategic category does have a direct correlation with OEE. And in keeping with your findings concerning TPM effectiveness, our plants that practice Operator Care (TPM) to a high level have much higher OEE than those that do not. However there is not the clear correlation with Precision and cost that you have described.
For us a wholistic approach to reliability that pursues excellence in six categories - management / culture, work management, basic cares, reliability tools, measures, and competency is necessary to achieve the twin goals of high reliability at a competitive cost in a sustainable manner. My suspicion is that your term "Precision" in actuality includes all of these areas.
2) Posted 9:44 am, 19 May 2010 by Dan Moss
I frequently use Winston's breakdown of how defects are introduced to illustrate how we are doing things wrong. His percentages are quite close to Ron Moore's.
Unfortunately, Maintenance Managers are almost always engineers and, engineers being engineers, they want to focus on the equipment and not the people.
The advice I always try to instill is to understand where the defects come from and then use Weibull or Crow-AMSAA (thanks to Paul Barringer) to see if there has been any changes. We have eliminated thousands of PM's and proven statistically that they were not doing any good and for essentially every category, we are seeing a reduction in the number of Corrective WO's.
But Winston is flying directly into the face of all of those that teach classic RCM and so many do not embrace his teachings. When it is those in my industry and our competitors, I just smile because I know that we can outperform them!
Keep the faith Winston!
3) Posted 1:19 pm, 19 May 2010 by Tom
4) Posted 6:11 pm, 19 May 2010 by Graham Chevis
What we discovered was that Dr. Deming was correct… the workers are just working within a complex system that is an interactive by-product of many factors… factors that only management can change.
This was so profoundly illustrated as we watched “The Manufacturing Game” in action. Operating supervisors and maintenance supervisors swapping roles yet taking on behaviors driven by the dynamics of their new positions within the system of a simulated and condensed 35 week period.
It was an eye opening experience to say the least! The Refinery Manager jokingly accused me of “brainwashing our people”… as operators returned expounding the benefits of teamwork and “defect elimination”. It was the paradigm shift needed to “kick-start” the reliability improvement effort and it was extremely effective.
Our strategy was integrated around the Operator, Mechanic, Equipment and work management system relationships. Based on the new science of Complex Adaptive Systems, we set up the conditions for personal pride/ownership, installed user friendly systems, provided the feedback loops and watched the excellence EMERGE from within.
Winston talked above about the high percentage of random failures “the 84% of failures due to careless work practices…” Well I have an issue with the word “careless”… but the percentage is closely confirmed by Moubray’s RCM failure mode work to be about 82% random. The word “random” however meaning we can’t find a single “assignable cause” but we can start to monitor and watch the conditions that can reduce the “probability” of failure.
So my point is that even though 84% of failures originate from “unassignable causes”, there are effective and holistic ways to impact the “system” to increase reliability.
By the way, the northeast refinery I am speaking of is currently operating at a 98.6% mechanical availability (including turnarounds..) ..one of the most reliable in the country.
5) Posted 9:58 am, 21 May 2010 by Chuck Wallace
6) Posted 8:03 am, 26 May 2010 by Ron Wallace
What we discovered was that Dr. Deming was correct… the workers are just working within a complex system that is an interactive by-product of many factors…. factors that only management can change. This was so profoundly illustrated as we watched “The Manufacturing Game” in action. Operating supervisors and maintenance supervisors swapping roles yet taking on behaviors driven by the dynamics of their new positions within the system of a simulated and condensed 35 week period.
It was an eye opening experience to say the least! The Refinery Manager jokingly accused me of “brainwashing our people”… as operators returned expounding the benefits of teamwork and “defect elimination”. It was the paradigm shift needed to “kick-start” the reliability improvement effort and it was extremely effective.
Our strategy was integrated around the Operator, Mechanic, Equipment and work management system relationships. Based on the new science of Complex Adaptive Systems, we set up the conditions for personal pride/ownership, installed user friendly systems, provided the feedback loops and watched the excellence EMERGE from within. From a “defect elimination” standpoint we focused on empowering the operator to identify defects early on the “P-F” curve and set up a maintenance response process that actually built teamwork in the process.
Winston talked above about the high percentage of random failures... “the 84% of failures due to careless work practices…” Well I have an issue with the word “careless”… but the percentage is closely confirmed by Moubray’s RCM failure mode work to be about 82% random. The word “random” however meaning we can’t find a single “assignable cause” but we can start to monitor and control the conditions that can reduce the “probability” of failure. So even though approximately 84% of failures originate from “unassignable causes”, there are effective and holistic ways to impact the “system” to increase reliability. By the way, the northeast refinery I am speaking of is currently operating at a 98.6% mechanical availability (including turnarounds..) ..one of the most reliable in the country.
7) Posted 8:19 am, 26 May 2010 by Chuck Wallace
Reliability is defined as the probability that a component, device, or system will perform satisfactorily for the designated period of time under design conditions. The concept of reliability as a probability means that any attempt to quantify it must involve the use of statistical methods. Whether an item works for a particular period is a question that can be answered as a probability.
Identification of the reasons for failure of process plant machines is the first step in obtaining increased reliability. “ The only satisfactory arbiter of reliability is performance in the field. If the user is to help himself, there is no alternative to the adequate collection of failure records. In order to get increased reliability, the causes of failure must be identified and appropriate actions taken to eliminate or reduce them.”
The combination of reinforcing reliability-driven behaviors and utilizing the practical techniques of data analysis and tracking will guide the organization to the optimum reliability evolution stage.
8) Posted 4:42 pm, 25 May 2011 by Woodrow Roberts
All the predictive testing in the world cannot predict when an operator will run a pump for hours with the discharge valve closed.
As long as Operations sees themselves as the customer and the Maintenance team as the ones to replace the broken equipment, it will not get better.
And too many Maintenace Teams like to see themselves as "Mighty Mouse" rushing in and proclaiming "Here I am to save the DAY!"
When an organization's sense of worth comes from "saving the day" and providing "good customer service" then things will not change.
Winston's Manufacturing Game will be an eye opening experience to many who think they are doing the right things.
9) Posted 9:54 am, 26 May 2011 by Ben Thayer