The ABC’s of Failure – Getting Rid of the Noise in Your System

For the past 40 years, I have observed many companies; including DuPont (where I spent 27 years) pursuing planned maintenance with the standard tools of planned maintenance: inspections, planning, scheduling, materials procurement, CMMS systems, etc. with the same results. They succeed for a while and get their percent planned and scheduled maintenance up to the 80+ only to see that drop back later to 60 I am amazed how many of the companies we work with have had this experience.

This pattern of behavior has led us to conclude that the reason for this experience is that only 60% of the normal work of maintenance is inherently plannable. The rest of the work is created by random acts of what we are now calling carelessness. The sites where we see people break this pattern and achieve 92% to 96% planned maintenance for the long term without regressing, are the ones who eliminate the inherently unplannable work. Of course, this cannot be done by maintenance alone. Everyone who does work at a site contributes to the defects that create the unplannable work, and therefore everyone must participate in eliminating the defects that create the 40% of the work that is unplannable.

In the diagram below we attempt to more clearly articulate the true significance of Defect Elimination by outlining the ABC's of failure. Fundamentally, failures happen because things that exist are not perfect. To reduce failures we must eliminate as many imperfections as possible. One of the classes of imperfections is "defects". Our studies have concluded that all failures of equipment and processes can be traced back to defects (we use bugs to symbolize defects). Therefore, defects are the basic cause of all of our failures and can be classified into three sources: A, B, and C.

1. Stands for Aging, which typically generates about 4% of the defects that eventually become failures over long periods of time (25-50 years). This is easiest to see in site infrastructure such as steel and concrete. These defects are generated even if the equipment is not operated at all.

2. Stands for Basic Wear And Tear of the equipment when we operate it, which typically generates about 12% of the defects that become failures over shorter periods of time (1 month to 7 years depending on the quality and appropriateness of the design). This may be easiest to observe in the most critical equipment in a site. This equipment usually has a long Mean Time Between Repair and has everyone's attention to any small deviations.

3. Stands for Care-less Work Habits, which contributes the remaining 84% of the defects that become failures over random periods of time. Care-less is not the same as irresponsible. While irresponsible habits would be included in the care-less category, it is a small part of the total. By care-less we mean, "not providing the care" that the equipment needs to run perfectly.

The ABC's of Failure

Most people do not realize that the vast majority of defects are created by Care-less Work Habits and underestimate the significant impact that defects can have. People waste much of their time and energy on trying to prioritize a long list of failure repairs that should never have happened in the first place. People refuse to accept that they have Care-less Work Habits because they equate carelessness with irresponsibility.

Most people fail to understand this way of looking at defect generation because it is hard to accept that we could operate that badly and still survive. The validation for this insight is well documented in Total Productive Maintenance (TPM) award winners in Japan who tell the story of eliminating 90% to 98% of their failures as far back as 1991. The data to support this has been around a long time. It is simply hard to believe. This is the blindness created by our assumptions. The BP Lima refinery eliminated 87% of their pump failures when they implemented defect elimination. One assumption that blinds us is that the failures are avoided by doing a repair early as a preventive task. This is clearly not what we saw at the TPM award sites. The defects that cause the failures are removed without need for a repair or the defects are never generated in the first place.

In our experience, we have observed three modes of behavior that create performance in three different Stable Domains. The domains are Reactive, Planned, and Precision. We concluded that there are three potential modes of action that an organization can take when dealing with defects.

  • It can React to the difficulties it faces without growing capability,
  • It can Plan, seeking to impose its will on anticipated future circumstances, or
  • It can choose to create a culturally based system of Purposeful action that embeds new ways of working into the everyday actions of workers.

The Planned Domain is a step above the Reactive Domain, but it is unstable. It creates the need for a parallel organization to plan improvement activities like equipment maintenance and personnel training. We now look at the previous chart on Stable Domains in a different manner from just a few years ago. We do not believe that it is necessary to go through the Planned Domain to achieve the Precision Domain. Instead we see this as a bifurcation when moving from Reactive...do you intend to try to plan unplannable work or do you intend to eliminate unplannable work? By rotating the axis of the previous chart it is easy to understand that trying to attain the Planned Domain will cause you to work harder. Instead, we recommend achieving the third domain through a program of eliminating unplannable work by defect elimination, with a core logic at every level of the company such as "Don't just fix it, improve it." How, then, can such a change be led without becoming yet another initiative? This is a topic for creative thinking on all our parts, and such change is sufficiently rare that there are no pat answers. One thing we are sure of is that it is necessary to pursue a higher purpose to achieve the Precision Domain.

So we ask, "Can each and every employee at a site answer the question, "How do I personally contribute to the company's vision?" Each and every employee who cannot answer that question is unwittingly contributing to the carelessness category of defects. There is no way companies can achieve inspirational visions unless they become great in their core business. So how can we advance up the Stable Domains?

The Life of a Defect

In order to understand why Careless Work Habits are such a large part of the failures category, it is helpful to consider a typical life of a defect. A defect is born when something is done that is not perfect. For example, the nut on a bolt that holds down the base plate of a pump is not tightened to the right torque. Life begins as a loose nut, which tends to further loosen if there is any vibration in the pump. As long as the pump is not running, this defect is Aging, and the fact that it is loose allows moisture to get under the base plate and corrosion begins to form. Thus, our defect has spawned another defect that is also growing. Both of these defects can have long lives since it would take 25 to 30 years for the base plate to corrode through. However, an operator decides to use this pump so he starts running it, which throws some strain into the bolt because of the motor torque and creates some vibration. The vibration causes the nut to become even looser, and now the base plate can warp slightly under the strain, which allows more moisture, creating more corrosion throwing the pump slightly out of alignment which puts an off balance load on the bearings. So you can see where this is going. As the pump operates, defects of the Basic Wear and Tear category are generated. On top of that we now have Careless Work Habit defects being generated from the loose nut because the operator closes down the suction valve to quiet the shaking from the loose nut, which creates some cavitation at the impeller leading to erosion. Therefore, we can see that saying "Defects Beget Defects" can sum up the typical life of a defect.

The rest of the life of this defect is determined by how much people care about the pump. If the operator notices that the nut is loose and uses a wrench to tighten it to the proper torque, the defect has a very short life and never becomes a failure. If no one takes care of the loose nut, it continues to beget more defects until a failure or multiple failures occur to the pump. This brings us to another principle, "Failure Events create extra Defects". If this pump is allowed to run until the bearings seize, the power from the motor could slightly bend the shaft in the act of failing which becomes the defect that causes the next failure. So the total number of failures that result from a defect is determined by how carefully the pump is operated and maintained.

Careful Work Habits can be defined as noticing defects when they are very small and removing them before they generate other defects or cause failure events. This is the essence of Total Productive Maintenance as practiced in Japan.

How Do Defects Affect Safety

If our equipment was perfect and we operated and maintained the equipment perfectly, there would be very few safety problems. Failures inevitably direct energy into places that cannot bear the amount or intensity of the energy and therefore cause collateral damage. This misdirected energy then is the source of hazards to the people and equipment. So the principle is Defects create Failures that cause Hazards that can hurt people and/or damage equipment. Personal Safety programs create capacity in the organization to cope with the hazards when they happen. Process Safety Management programs create capacity in the organization to eliminate the defects that are the root causes of the failures and hazards. The Personal Safety programs are typically designed for the immediate reactions and short term dealings with defects. The Process Safety Management programs have to deal with all three classes of defects. To properly deal with Aging defects the Process Safety Management programs must look ahead 25 to 50 years. This long term perspective is the reason that Process Safety Management requires very senior management attention to be effective. They are typically the only ones who can support programs that have such a long life.

The Remedies

First, we expanded the operating domains chart to include two more domains that were not in the original benchmarks we conducted at DuPont. The diagram below shows 5 domains to account for a state below Reactive, which we call Regressive, and a state above Precision, which we call World Class. The Regressive Domain is below Reactive because the organization, for whatever reason, is not capable of reacting to all of the needs of the facilities. The World Class Domain is added to recognize the opposite end of the spectrum where the inevitable progress happens through human innovation.

Determine which Stable Domain your organization is in

In order to decide what should be done, first it is advisable to determine which domain your organization is currently in. In this evaluation it is wise to consider the history of the organization to make a good judgment. For example, the defects generated by the Aging sources of defects have long lives before they become failures. Therefore, even though Aging is only 4% of the total defects being generated, if they are left unattended for 40 years, they can produce failures equal to more than 100% of normal failure rates. At this point, your organization has fallen out of the Reactive Domain into the lower one, which we call the Regressive Domain. In this domain, at least twice the number of failures as normal must be dealt with. This is the case at many older facilities where the infrastructure has been neglected. In this domain, the organization usually does not have the resources to deal with the situation.

For sites in the Regressive Domain

If you end up in this Regressive Domain, your people are not able to react to all of the failures adequately. At this point, there are several choices that can be made.

  1. You can decommission the facilities that are in the Regressive Domain.
  2. You can double your resources for dealing with failures.
  3. You can reduce the rate of operating the facility so that the failures from the Basic Wear And Tear are reduced to a level that can be handled by existing resources. (This is a good choice since Basic Wear And Tear is 12% of the failures and the reduction in rates usually also reduces the defects from the Careless Work Habits because there is less activity going on.)
  4. You can improve the efficiency of the process for dealing with failures which might be a better planning and scheduling process, parts procurement, predictive techniques, management of work, etc.
  5. You can lower your standards of performance and let the resources you have do the best that they can and hope nothing catastrophic happens.
  6. You can sell the property to an organization that has the resources to restore it to sustainable performance.

All of these options for getting out of the Regressive Domain are survival techniques and do not add to the organization's capacity to sustain the performance over long periods of time. They are just ways of mitigating the risk of having a catastrophic event. These choices basically return you to the starting line in the Reactive Domain.

For sites in the Reactive Domain

Option 1. Pursue the Planned Domain

Once you have succeeded in restoring performance to the Reactive Domain and are meeting your standards, the question of what domain you want to pursue can be addressed. If you decide to pursue the Planned Domain, some means of dealing with the fact that this domain is unstable must be established. The only facilities that we have observed, who have stayed in the Planned Domain for long periods of time, have implemented some version of the "900 pound Gorilla" strategy. The essential element of this domain is that your organization will impose their will on the facilities by force of some personalities. Therefore, either those particular people have to remain in the power positions at the site for long periods of time to make sure the infrastructure is maintained, or you have to have a very strong discipline of succession in those jobs to avoid the inevitable back-sliding into the Reactive Domain.

Option 2. Pursue the Precision Domain

The Precision Domain is very stable and resilient to change. The BP Lima and Premcor Port Arthur sites have remained in the Precision Domain for 10 or more years despite multiple changes in ownership and management. One simple explanation for the difference in stability is that rather than relying on a "900 pound Gorilla" to maintain performance, a site has "900 one pound Gorillas" maintaining performance.

If you choose to pursue the Precision Domain, there is a 3-stage program that must be achieved before the defect elimination culture is created. There are also three essential processes to make the transformation.

  • The first is a process to create urgency to change that is based on a strong business case for change.
  • The second is a process to empower the workforce to make decisions in their work based on a culture of defect elimination.
  • The third is a strong leadership process to guide the change process long enough for the defect elimination habit to be self-sustaining in the form of a work culture.

This program can be accomplished in 12 to 18 months but must go through three stages. The first stage is to unfreeze the organization by establishing that the status quo is no longer acceptable. The second stage is the changes to processes, structures, and systems that support the defect elimination culture. The third stage is to refreeze the organization into the new culture and end the change process. In the third stage, the organization continues to improve and people learn that continuous improvement does not mean continuous change, but comes when improvement is accepted as a normal part of the every day work and involves constant vigilance to provide the care needed to improve the performance as the environment changes.

Recommendations

We recognize that most organizations are stretched and having difficulty meeting the many demands being placed on them. Therefore, we have recommendations for what should be discontinued to free up resources as well as what should be done.

What to do

Based on our observations, we believe that many companies have assets that have fallen into the Regressive Domain and consequently we make the following recommendations.

  1. Significantly reduce defect generation rate in equipment and processes in the Regressive Domain through the use of equipment proprietors. A well-designed proprietor system restores the infrastructure deterioration back to acceptable standards. The proprietor system should be designed as a widely distributed system to match the distributed nature of the defects. Basically, every cubic inch of property should be assigned to a proprietor who is in a position to visually inspect the real estate and hardware personally. The proprietor should be the voice of the needs of the property he is assigned. Each proprietor should be assigned no more property than he could inspect in one day using his normal mode of transportation. The role of proprietor is simply to run equipment within standards or shut it down. When a property is shut down, only the Aging defects accumulate, which is a reduction of 96% in defect generation rate so the property can remain relatively stable until the resources are available to deal with the higher defect generation rate. The proprietor should not be the budget holder for his property - that way he can concentrate on maintaining standards and not have that over ruled by monetary constraints.
  2. Engage the entire workforce in defect elimination using cross functional Action Teams as a means of creating a culture that assumes equipment improvement as a normal part of the everyday job. Once a particular proprietor determines his facility to be out of the Regressive Domain and into at least the Reactive Domain, a systemic process of defect elimination must be created within the workforce. The basic need is to learn how to work cross functionally instead of in silos and to learn to treat contractors the same as employees in this process. Much of the randomness created in the Careless Work Habits category comes from the lack of appreciation for the needs of the other functions that impact common equipment and work processes. By using cross functional teams to eliminate defects, people learn how to be a team while doing their normal work. We have observed an incredible amount of organizational learning take place among people in these teams. Launching cross functional Action Teams must continue until this cross functional way of working becomes a habit and the generation rate from Careless Work Habits is reduced by about 30% of total work orders. 2.
  3. Create a leadership process for the culture change based on boundary setting that creates freedom for the workers and proprietors to make decisions aligned within standards established through reflection and dialogue. Management must learn how to create boundaries for cross functional teams so that the teams are free to make decisions on their own that are within the level of risk that is tolerable to the management. The focus of management work should be to avoid two other kinds of imperfections -- excesses and recycles. This requires conscious and creative energy. The best questions a manager can ask are "Where did the excess energy come from that created these defects?" and "How can we keep this kind of a failure from happening again in the future since we can't change the past or the present?"
  4. Standards should be set on the tolerance of imperfections in the outcome of work and not on controlling the process for attaining these results. Process controls only deal with the functional aspects of an operation and ignore the "will" and "being" of the situation. Therefore, process controls only deal with one third of what the people are experiencing. While many processes have universal application to many different situations functionally, they do not deal with the will of the situation or the being of the organization dealing with the situation. Attempts to apply universal processes to all situations tend to get so complicated that people are not able to use them. Many of these processes are based on the assumption that Aging and Basic Wear And Tear sources of defects are the only defects present and therefore ignore 84% of the situation

What to stop doing

  1. Do not treat each initiative as if it were independent of all others. Use defect elimination potential as the key principle in rationalizing all initiatives into one to maximize the probability of success and minimize the risk of catastrophe along the way.
  2. Do not use internal change agents. Using outside change agents is advised to preserve the talent of employees to concentrate on the main line work of pursuing the corporate vision. The change agents will only be required for a few years to make the culture change and will not be needed after that so it is better to rent them rather than own them. Also once the change reaches Stage 3, it is necessary to remove all change agents so that the organization can refreeze into an efficient operation of the new culture.
  3. Do not focus on implementing systems. Implementation of systems to institutionalize the new culture should not take place until you reach Stage 3 of the change process. If you implement systems first, you do not have the culture to use them so the systems get perverted to accommodate the old culture, which takes an incredible amount of time and distracts the most talented people from the main work of changing the culture. Implementation in Stage 3 is better because the work practices match the system being implemented, and by this time as much as 84% of the defect generation rate has been eliminated so the job is one-sixth the size as it would be in Stage 1 of the change.
  4. Do not focus on team building. Most people attack this change in work culture by training people how to be a team. In many of these approaches, there is no common work for the people to do together in the training so they learn how to be a team when doing training exercises but not how to do their normal work as a team. Recommendation #2 of what to do above will create the desired teamwork and is an example of how resources can be made available by integrating initiatives.

Summary

In July of 1998, The BP Lima refinery experienced the successful completion of a Hero's Journey change effort.

This Hero's Journey change began earlier in the 1980's. Lima employees did not realize at the time, that through their hard work to keep the plant profitable, they would ultimately save their own jobs and save their plant from closure.

Guided by Continuous Improvement Forums, everyone at the site participated in defect elimination practices that improved operational profits by over $40 million a year which tripled their earnings in three years. At the same time safety improved as well. OSHA recordables dropped from five per 200,000 man-hours to two and they experienced 5 years without a lost workday case. Environmental impact was reduced which also contributed to profit by reducing crude losses from 1.3% to 0.3%, a pacesetter performance for a refinery. Mean time between pump failures changed from 12 months to 50 months, which accounted for $1.5 million in cost savings but more importantly, it reduced the number of work orders on pumps by 78%. The total workload for the maintenance people dropped by 37.5% in those 3 years. But that is only half of the story, in the next 5 years; they cut the workload by another 50% in maintenance for an overall reduction of 70% in maintenance work orders. Their budget for maintenance dropped 60% in the 8 year period and they were running the refinery with 20% less manpower.

Contrary to other cost cutting experiences in the past, these improvements are continuing after 13 years without outside help. This was sustained through four changes of ownership and many managers. The first sale of the refinery in 1998 was for $215 Million. In 2007, the Lima refinery was sold for the third time to Husky Energy for $1.9 billion, which the Wall Street Journal pointed out, was a 51% premium over the any refinery that was sold in the USA that year.

The results Lima experienced are staggering. What was the formula for success at Lima? And what can people do to make this formula successful for them?

As stated earlier, three things are required to make a culture change of this type.

  • An urgent business need. If your site needs to improve safety by at least 40%, reduce maintenance costs by 30% to 40%, and improve earnings while improving the quality of life for everyone on site, then perhaps you have the urgent business need.
  • An empowered work force to make decisions in their work. See the NPRA paper "A case study in effectively implementing corporate change initiatives" in the 2007 NPRA Maintenance and Reliability Conference proceedings to learn more about how this is done.
  • A strong leadership process to guide the change. Since initiative overload is common today, the first step leaders should consider is rationalizing all initiatives into one with defect elimination as the key principle.


Watch a Video Presentation on the ABC's of Failure by Winston Ledet