Don't miss MaximoWorld 2024, the premier conference on AI for asset management!

Experience the future of asset management with cutting-edge AI at MaximoWorld 2024.

Sign Up

Please use your business email address if applicable

Human Error in Maintenance and Reliability and What to Do About It

  • A long term strategy and set of tactics to assure that errors are progressively minimized and/or mitigated for the duration of an organization's existence,
  • A simple analytical method to detect recurring problems that cause seemingly small delays or reduction in throughput or delivery of a service,
  • Who to turn to first for solutions, especially when safety and/or equipment reliability are concerned and how to get their attention,
  • What types of policies get the most cooperation from all levels of an organization, when the goal is to minimize human error and maximize profits in an increasingly competitive, global economy,
  • A systematic root cause analysis technique that focuses first on the human elements rather than on the technical elements of the overall problem,

Introduction: In observations from leadership positions over the past 50 years, I have seen many "new" approaches set forth as the ultimate answer to maintenance and reliability ( M & R) improvement. Some of the promoters of these new methodologies or approaches have attempted to appeal to all organizations in commerce, academia and government. These ideas come and go - mostly go - into obscurity. However a number of fundamental truths and fairly simple concepts seem to me to work best in all venues. This presentation is an attempt to summarize those that I have found to work best and to provide some guidance as to where to find more information on the basics that seem to me to work universally.

These observations are presented in the form of "concepts." The concepts are presented to help illuminate the most common human errors. Errors are committed because of omission or ignorance of the best ways to get consistent results from people who, in the overwhelming majority in my opinion, want to do the best possible jobs that they are assigned or choose to do. The concepts are inter-related and cannot be encapsulated uniquely. So, it is not possible to isolate one and concentrate on it without missing something and not doing so well in attempts to apply each of them.

Concept #1 - Exercise and practice leadership as well as management.

Starting at the nominal "top" of any organization, its "management" group or individual, I find that many problems could be solved easily and most effectively if attention is paid to the functions that those in top positions of organizations should be performing. In my mind the most important of these functions can be summarized as follows:

  • Providing direction, objectives and goals through effective (two-way) communications,
  • Obtaining (and keeping) the resources needed for people to do their jobs most effectively and efficiently,
  • Removing impediments to reaching the ultimate objectives of the organization,
  • Adding or removing constraints that are needed to keep the organization focused,
  • Thoroughly understanding, and constantly refining the processes that the organization has to execute to serve its customers.
  • Providing leadership

Unfortunately for those subject to management with little or no leadership, one finds that a great deal of time is spent on organization and reorganization, a waste of time for anything other than communications, in my opinion.

In the course of executing management functions, however, "managers" many times do not (or do not know how to) exercise leadership. The goal of this paper is not to teach leadership, but merely to call attention to the absence of it in many instances and where to find information on what it is and how to learn about it.

One of the key ways of exercising leadership is to learn to listen. Listen to those in positions above (who must serve all below), co-equal persons along side you and employees in positions below yours (those you serve). Combining the two key elements of leadership and continuous process refinement mentioned above and discussed in detail in the references listed in footnotes below, the best way to obtain and maintain the processes related to maintenance and reliability (and the overall business processes from which the latter flow) is to engage key employees in technical and first line supervisory positions of the organization in developing them.

The initial process diagrams should be complete with the identification of existing impediments, needed (and unnecessary) constraints and resources needed to make them most effective and efficient (clean and clear) in actual practice. It's the manager's job (and should be his or her "agenda") to make this happen, once the list of impediments, constraints and resources is prepared. He or she must get busy and act as decisively and rapidly as possible to clear any obstruction to success of the processes needed to serve both internal and external customers. In so doing, the smart manager goes far beyond obtaining "buy-in" but also creates a sense of "ownership" in the processes by all who participate in development and refinement, a far more valuable attribute.

Concept # 2 - Look first at "programmatic" rather than "technical" solutions to reliability problems. In the decade after the 1979 Three Mile Island incident involving a nuclear reactor core meltdown, the organization responsible for regulation of commercial nuclear power plants and related facilities, the U.S. Nuclear Regulatory Commission (USNRC), searched for ways and means to avoid future events of this nature. Many of the solutions involved redesign of power plant control rooms and instrumentation as well as plant safety systems for mitigating and controlling any malfunction at its early stages. This was nothing new to those in the regulatory organization. It was revisiting reactor and safety system designs that it had concentrated on from the USNRC's beginning even before separation from the Atomic Energy Commission and the Department of Energy. What was new to them was the need to regulate the operation and maintenance of existing plants, something that was in an early stages of development and implementation when the 1979 incident occurred. In fact the way the applicable law and regulations were written in 1979, USNRC had only limited authority to regulate maintenance practices. It was not until the mid 1990's that, despite intense lobbying and resistance by nuclear utilities and their various industry organizations, the "Maintenance Rule" became law, thus giving the USNRC authority its Commissioners felt was needed to assure safety.

Among the many studies performed under USNRC sponsorship was one concerning "programmatic root cause analysis." A model was developed addressing the essential precepts of the causes of human errors in preventive and repair (often called "corrective") maintenance. The model was tested for several years at a nuclear power plant which had new and open-minded management (for reasons which will become obvious as you read ahead). The model provided a set of four (4) "diagnostic query diagrams" titled as follows:

  • The Training Query
  • The Procedures/Documentation Query
  • The Quality Control (QC) Query
  • The Management Query

Some surprising results came from this study, including but not limited to the following:

  • Although technicians performing maintenance were not eliminated as root causes of defective maintenance, their inadequate performance was found to be most likely the effect rather than the root cause of subsequent (infant or premature) equipment failures.
  • The queries focused on isolating such root causes as line and upper management performance (including their attitude towards their responsibility for craftsperson performance), procedures and documentation (both the product and the process), training (both delivery and the process), managers of such program elements and quality control .
  • Quality control was found not to be a primary cause, but a co-cause.
  • Management was found to be both a primary as well as a co-cause of inadequate maintenance by technicians.

The landmark finding that management performance was often the root cause of equipment premature or infant failure after maintenance had monumental impact. Often, the continuation of or granting of new operating licenses for nuclear generating plants and related facilities rested upon the assessment by regulators of attitude of nuclear utility managers at all levels towards this cause. This led in some cases to major shakeups in management teams until those in place could satisfy regulators that they understood their shared responsibility for equipment failures along side those whose hands were actually on the equipment.

This finding concerning management's direct involvement in equipment failures is seldom, if ever applied outside the commercial nuclear power industry.

Only now is it being considered for application (along with many other initiatives) to the British Petroleum (BP) Refinery at Texas City, Texas involved in a fatal accident on 23 March 2005 that resulted in 15 deaths and a much larger number of serious injuries.

Not only were there costs involved to those on site and BP stakeholders world-wide. The long outage that was needed for disaster recovery (along with maintenance problems at other plants for a variety of reasons) caused a shortage in refinery products that everyone in the country paid for in terms of a spike in fuel prices for many months after the accident.

Another finding that ultimately came to light from the USNRC sponsored study was that the majority of root causes of infant or premature equipment failures could be eliminated or at least mitigated and reliability improved less expensively and more rapidly by programmatic solutions than by technical (re-design) solutions. That's why the concept is stated as it is

Those who have developed approaches to root cause analysis of equipment failures all claim that their methods address the queries listed above, and there is no doubt that this is true in theory. However, in actual practice management and many other programmatic causes of failure are considered co-incidental, if not prohibited, areas of investigation requiring corrective action. That is unless and until a major fatal accident occurs and an outside, independent panel conducts an investigation, as happened in the BP Texas City case.

Indeed, the programmatic root causes of failures may well go beyond the industries that suffer from them. The rash in 2006 and 2007 of children's toys and costume jewelry items containing Lead having to be removed from. store shelves certainly has as one of its root causes the lack of oversight of foreign manufacturers by their U.S partners and the significant reduction in staff and other resources for testing such items at the U.S. Consumer Product Safely Commission (USCPSC). At least one of the major toy distributors in the USA apologized to the American people in testimony before a Congressional hearing committee on the subject in mid 2007. On 11 September 2007, the USCPSC and its mainland Chinese government counterparts announced an agreement on work plans concerning safety of toys, fireworks, cigarette lighters and electrical products.

Findings of contamination in food imports in a large number of products from several countries, that got heavy media attention during 2007, may be traced to the lack of U.S. Department of Agriculture and/or customs inspectors at ports of entry and in U.S. processing plants. Certainly these may also be traced to lack of management attention or poor attitude, but a co-cause may well be lack of resources resulting from political decisions concerning regulatory agencies charged with inspection and enforcement of the rules. Countries of origin may also share some of this responsibility. In China, a whole new regulatory regime is being created to address this deficiency and many marginal producers have already been shut down as a result of early action.

Concept # 3 - Look for indicators of small, seemingly insignificant but repetitious reliability problems and act on the findings. Another feature of the report on programmatic root causes of equipment failure was the description of an easily implemented analysis method for determining where to apply limited resources to solve many problems of equipment maintenance and reliability. The method was given the title "Cluster Analysis."

Cluster Analysis is explained in just three pages of the report and consists of the following steps:

  • Sort the Data
  • Identify clusters
  • Determine which clusters are relevant
  • Group the clusters into categories
  • Determine the consequences of relevant clusters
  • Determine technicians involved, when necessary

The table that follows provides an indication of how statistics on relevant clusters looks over time.

human error

This is the sort of analysis that can be carried out at any level of an organization and should lead to triggering some root cause analysis activity and follow-up. The numbers presented reflect the reduction actually experienced at the plant where this method was tested. As observers get more familiar with the steps of cluster analysis, new clusters will emerge as earlier ones are dealt with.

Many of these items in and of themselves may cause small delays in production, but their cumulative effect over time can have quite a substantial effect on the bottom line and/or mission of the organization that engages in this relatively easy method of identifying problem needing action.

Concept #4 - Don't be afraid of mistakes; learn from them.

Typically after an incident involving substantial cost to recover, injury or death the search is started to find the "guilty" parties so that they can be held accountable. This is the wrong (management) approach in all but those cases where malicious intent is apparent initially or determined to be a cause in the course of investigation of the incident.

Managers sometimes contract for third party investigators to find the root cause of serious events. This may be OK for the overall look at what happened and to prepare a professional report of findings. However, it was been shown to be the cause for those involved to take a defensive approach that impedes the full story being revealed. This in turn inhibits appropriate action being taken to eliminate or mitigate of the true root cause, liability claims and related court cases notwithstanding.

Policies and practices that have been proven to avoid repeated problems described above are discussed below.

Adopt a No-fault Policy - Adopt a no-fault policy regarding apparent accidents and incidents. The policy should have a corollary provision that emphasizes the need to learn and not suffer unnecessarily from undesirable events. Stopping any attempt to "blame" someone will aid in more quickly getting to the truth of what happened and the ultimate solution. Learn from the mistake; correct the problems, and get on with business of serving customers and providing all stakeholders with the fruits of their investments in the organization. Those who fail to learn from mistakes and repeat them should be assigned where they can't continue to cause harm to people or equipment or, as a last resort, be let go.

Adopt a Compliance Policy - Implement a compliance policy that applies to the use of all operating and maintenance procedures as written or if found deficient in some way, as modified by competent personnel following the approved procedures management process. (See Concept # 5 - This assumes the organization has adopted a goal of becoming a "Procedure Based Organization" which in turn has a formal feedback and follow-up process in place that assures prompt action on all recommendations for changes.)

Practice Peer Review - When a major equipment failure and/or personnel injury/fatality incident occurs involving one or more personnel, and there is an opportunity for interview of those involved, institute a practice of "peer review." The purpose of this practice is to fully identify what happened and what should be done to eliminate or mitigate the incident being repeated in the future. The chances of those involved producing an accurate picture of what happened and coming to a conclusion as to what to do to prevent such incidents in the future is greatly increased when they are talking to their peers, without managers or other "outsiders" present. The intent of protecting co-workers and other stakeholders from repeating any errors must be the central goal of such a practice. It can be effective, however, only if the practice is backed by the no-fault policy stated above. Ultimately it may be found, as indicated in the concept describing programmatic root cause analysis, that management needs to do something to eliminate or mitigate the problems revealed. This may include providing more training, better documentation or even changing their attitude concerning provision of other resources needed to ensure no repeat of the incident. The no-fault policy applies to management, also, as viewed by those who are subject to its leadership.

Practice focusing on the incident at hand while it is being investigated. Anyone who has been involved in root cause analysis or reliability centered maintenance analysis knows how easy it is to have the process prolonged or even sidetracked by unrelated issues that arise. There is always a strong desire by those assembled to perform such tasks to discuss all the perceived, current problems of the organization. Those assigned to facilitate such analyses must acquire and liberally apply the skill of diverting such discussions and re-focusing attention on the matter at hand. One very effective way of doing this is to start listing, by title only, "Other Items of Interest" for presentation along with the report concerning the incident at hand. Thus the group discussion can be refocused on the incident, along with assurance that the list of Other Items of Interest is prepared for presentation to management along with (or included in) the root cause report.

Concept # 5 - Become a Procedure Based Organization, but don't overdo it.

In a variety of presentations I have written, co-authored or contributed to the emphasis has been on becoming a "PBO - a Procedure based Organization." In this text an example is provided where such advice was taken too far. A Procedure Based Organization produces or receives and complies with detailed written instructions for conducting not only maintenance, but also operations and routine checks. This seems so basic that it is overlooked in most organizations and for all the wrong reasons! It's so much easier than it used to be, given availability of low cost word processing and scanning and image insertion equipment. There is hardly any excuse for not doing it, given the benefits derived in terms of increased reliability and consistent delivery by the operators and maintainers of the maximum possible capacity of a production line. The fundamental approach is depicted in the diagram below.

Procedure Based Organizations

Not only does an activity have to declare that it has a Procedure Based Organization, but it has to back it up with a working process for procedure and checklist origination, dissemination, feedback and follow-up. The idea of feedback and follow-up is reinforced in the diagram above by arrows that imply two-way paths for communications. It is not enough just to disseminate an initial set of procedures and checklists. Users must have on-going evidence that their ideas for improvement are being received, considered and acted upon promptly. Changes that are concurred in must be seen to be incorporated in revised procedures and checklists coming out of a process that functions as well as is expected of all the maintenance and operations processes it supports. Otherwise, enforcement of a policy requiring compliance will quickly become impossible, because of a perception that management support for the process and related policies is weak or non-existent.

In July 2004 I conducted a one-day seminar in response to a query concerning what it took to become the "world's best maintenance organization." The activity where the seminar was held had been operational for only 18 months after rejuvenating a portion of a steel plant that had a hundred year history before shutting down and going out of business three years earlier. The new organization was doing quite well, having returned the equivalent of 80% of its new owner's investment in the short time it had been operating under new management and carefully selected staff. However, all there knew that world steel prices, then inflated due to the "China Bubble," could very quickly deflate to where they might not be competitive with foreign suppliers of the products they manufactured. They saw maintenance as an area where their equivalent profit margin (return on investment to their owner) could be improved and their own jobs kept securely in the USA. After attending the seminar, which stressed, among other things, use of detailed procedures and checklists for both operations and maintenance, management decided to apply the principles to startup of one of their most complex manufacturing processes. The operating and maintenance staff prepared a check-off list for start up of all systems needed to roll steel bars into coils of wire ready for shipment. Typically this evolution, which occurred every Monday morning, was fraught with multiple delays while the systems involved were aligned correctly and adjusted to the required level of throughput.

About two weeks after the seminar, I followed up with the company president. He volunteered that they had applied the rolling line startup check-off list for the first time that week. They decided to run the check-off twice before the first bar of steel was introduced to the line. They found in the first check that they had missed two items. After correcting these items before the second run-through of the checklist, the startup went without any delay or incident, a first for that plant under the new staff. If ever there was a "Hallelujah Moment," for one preaching the benefits of detailed procedures and checklists, that was it for me.

In the summer of 2005 I conducted a procedure and checklist workshop for Gallatin Steel Company in Kentucky, which is owned jointly by Brazilian and Canadian firms. Following the lead of one of its owner companies (Dofasco of Hamilton, ON, Canada, which that year had been declared by The Wall Street Journal the most profitable steel company in the world) the management decided to embrace a key element of the parent company's success -- use of detailed procedures and checklists for maintenance. In the course of the workshop conducted for key technicians and supervisors (with managers present only for the beginning and ending sessions) a detailed process was developed for origination and on-going support of procedures and checklists. A format and detailed outline was decided upon for the actual documents and the decision was made to produce all them in house, using overtime to pay those craftspersons who volunteered to write the procedures.

Two years later, following up with the project manager, I found that the organization had produced over 500 detailed preventive and repair maintenance procedures and checklists. In response to a request for an opinion on what the major benefit was from all this effort, he responded by saying that the biggest benefit was the significant increase in confidence that the work force had gained in performing maintenance. Delays and frustration with not having the correct tools or replacement parts was radically reduced.

The company has been rated by the Kentucky Chamber of Commerce and the State Council of the Kentucky Society for Human Resource Management as one of the best to work for in the state. Forbes Magazine ranked Gallatin 16th overall as best large company to work for in the USA in 2006. Its parent, Dofasco has consistently received similar recognition in the Province of Ontario and in Canada overall as one of the best places to work.

On the downside of this concept, it is possible to demand too much of the craftspersons who are required by a compliance policy to use procedures and checklists. Recently I was requested to participate in a conference call with the representative of a major corporation which on any given day operates about 1100 facilities world-wide. Also on the conference call were representatives of one of their contract maintenance suppliers. The craftspersons of the contractor were resisting the imposition of mandatory check-offs (by initialing) for each step of every maintenance procedure they were required to conduct. In addition, a rigorous audit procedure with punitive provisions for non-compliance by maintenance personnel had been prepared for implementation as part of the customer's compliance policy. The craftspersons who were pushing back had, in my opinion, a good case for doing so. The client had gone way beyond the best practice in use of procedures and imposition of a companion compliance policy.

In organizations engaged in this best practice several types of procedures (and checklists) are commonly used. These types are summarized in the table below. The basic ones are given titles like Standard Operating Procedures (SOPs), Special Operating Procedures (SpOPs), Critical Operating Procedures (COPs), Standard Maintenance Procedures (SMPs)), Special Maintenance Procedures (SpMPs), Critical Maintenance Procedures (CMPs), Preventive Maintenance (PM) or Predictive Maintenance (PdM) procedures. Standard, PM and PdM procedures define common, often repeated, operations, maintenance or condition monitoring tasks.

All but critical procedures may be written in "two-tier" format. The first tier is an abbreviated version of the second tier that provides a more in-depth explanation and additional steps for use in training of new personnel or occasional review by experienced personnel who may not have performed the standard task for some time.

Operating procedures in organizations following current best practice often contain many routine preventive maintenance tasks which are assigned to operators for completion. Note that individual sign-off on each step is required only for safety or critical task procedures (and checklists).

Typical Procedure and Checklist Categories

Human Error

During the conference call, I emphasized the need for "trust" in the client-contractor partnership that extended to the conduct of operations and, in this case, maintenance. I recommended the audit requirement be abandoned completely and that the procedures be categorized per the definitions in the table above with individual steps required to be checked off only for safety and critical maintenance tasks.

Concept # 6 - Eliminate as much maintenance as possible and increase emphasis on reliability. In the past the traditional view was that the two goals stated in the concept statement above are contradictory and impossible to achieve. However, this is not the case. More maintenance does not produce more reliability pre se. In fact it can be a root cause of reduced reliability. If the organization has created the optimum maintenance program and knows exactly what maintenance to perform (that which is (cost) effective and applicable (i.e., it works) - a result of a proper application of Reliability Centered Maintenance (RCM) methodology) and exactly how maintenance should be done (a result on proper application of Total Productive Maintenance (TPM) principles), then the stage is set for concentration on the goals stated in the concept above.

A major pillar of TPM and one that is often neglected may be stated as "Manage equipment in order to prevent maintenance." Much can be done at the design stage to eliminate, reduce or at least minimize the hours spent maintaining equipment through application of maintainability principles and choosing components with generous service factors. However, most of the M & R world is faced with the equipment already in place and in production, acquired on a lowest purchase and installation cost basis. Thus, the challenge is to improve the reliability and maintainability of the equipment we have, not the equipment we'd like to have. Some texts on TPM refer to this as "Corrective Maintenance," a term that means, in the context of TPM, modifying the equipment in service to improve its design and by extension its capacity to reliably produce a product or service at lowest possible overall conversion cost.

The cost reduction from increased reliability and decreased maintenance can be significant. It affects the overall conversion cost of a product or service. The reduction in cost directly affects the profit margin, and/or makes it possible for a company to offer cost savings to customers, thus improving competitive position in the marketplace.

Unfortunately, the management error that is often committed is to mandate maintenance cost reduction without compensating by providing a comparable improvement in reliability or maintainability, both of which require labor hours on never-ending, continuous basis. This is exactly the opposite of what should be done. More often than not the decision maker gets away with it in the short term. This is because the "easiest" target for cost reduction is most often maintenance personnel (layoffs of "excess" personnel). Typically such action causes a pullback from proactive maintenance and a fallback to reactive maintenance (high priority, if not emergency repairs). This results in a more costly approach as time goes on, especially when lost opportunity costs are considered. The full impact may not be felt for many months, and in some cases up to 2 years. When the percentage of inoperative equipment reaches an intolerable point, and maintenance personnel are again augmented so labor hours can again be devoted to proactive measures, it takes about two more years to fully recover to the high point of performance where the layoffs began. In fact a study performed at Massachusetts Institute of Technology (MIT) shows that these cyclic events do, in fact, occur.
What should be done when excess labor hour become available after proactive maintenance practices take effect? The nature of the jobs experienced maintenance personnel are performing should be changed! Emphasis should be placed on the following:

  • Maintenance prevention and elimination
  • Reliability improvement and sustainment
  • Capacity enhancement

This is done by acquiring and putting in place and using:

  • Rules - related to best practices in maintenance and reliability
  • Tools - acquisition application and continual updating for maximum productivity
  • Schools - to teach the new skills need by modern maintenance organizations

For more on rules, tools and schools, see the discussion under the next concept for avoiding human error in maintenance and reliability.

A true story involving a 30 year old aluminum production plant reflects this cyclic effect. In the mid 1990's the plant changed hands from foreign to U.S. owners. The new owners hired new plant, maintenance and reliability managers to see what they could do to improve profitability of the aged but still reasonably profitable plant. New management's initial assessment showed that the throughput under the maintenance strategy they inherited was only about 50% of the designed-in capacity. Given that the company could sell everything it could produce, the team set out to increase throughput by changing the strategy to a more proactive one. A vigorous predictive maintenance program was instituted. Root cause analysis and reliability improvements were made to existing equipment. Within about two years the throughout had been increased to about 75% of projected maximum capacity. The owners were making so much money they purchased three more aluminum plants in a distant state. The plant manager was promoted to vice president and his replacement recruited from another aluminum producer, with the promise of autonomy in his running the operation.

Within the first week, the newly hired plant manager made his views concerning further improvement well known. He told the predictive maintenance team that he didn't understand what they were doing and recommended they bid for jobs where they had "real" tools in their hands to "fix" things. Reliability improvement initiatives were put on "hold."

The reliability manager quickly found a new position in the expanding corporate office. His assistant, hired to oversee RCM projects, found employment at a nuclear power plant. The predictive maintenance team leader (in a salaried position) was stuck for a time while he finished a master's degree program at a local university, but he, too, left for a job managing contract maintenance at a new steel plant.

For months, throughput remained where it was when the new plant manager took over. But in about a year there was a distinct downturn in production. The vice president paid a visit to review performance and found only the original maintenance manager from his "dream team" still in place -- but scheduled within two weeks to move to another company where he had accepted an offer of a maintenance manager position.

The vice president found out from the soon-to-depart maintenance manager what had happened, confronted the plant manager and fired him, assuming the plant manager's duties in addition to his own. The maintenance manager was promoted to be plant manager at one of the newly acquired facilities.

The vice president has been trying to reverse the downward trend in production ever since by building a new team and restoring confidence in the union staff members who remain in or returned to their hard-won, higher paying predictive maintenance positions.

Concept # 7 - Don't forget the roots of your M & R program initiatives for improvement.

It is not uncommon, with so many new initiatives being offered in the field of M & R, to see earlier, even highly successful principles and methodologies abandoned or forgotten with the promotion, retirement and transfer of those who implemented them.

One of the earliest adopters of RCM methodology developed for commercial aircraft was the U.S. Navy. In the 1970's and 1980's vigorous effort were undertaken to change maintenance from more a costly, shipyard-based strategy to one anchored in RCM and operating base support.

By the 1990's, most of those engaged in implementing the "new" RCM-based approach had retired or moved on to other jobs, due in part to the post-Cold War draw-down in naval forces and related support facility manning.

By the late 1990's the Navy found that its maintenance programs were in need of overhaul and revitalization in order to ensure reliability in the face of apparent return of intrusive maintenance requirements that were superimposed on the RCM-based strategies (which differed from class to class of ship and submarine). In addition, while the specifications for building ships still contained the Department of Defense mandated requirement to provide an RCM-based maintenance program, new methods of contracting for ships often resulted in these efforts being under-funded and inadequately implemented. The ship builders often simply implemented original equipment manufacturer (OEM) recommendations, which had been determined in the studies done decades earlier to be heavily tilted towards regular "overhaul," requiring heavy life-cycle replacement parts costs. The OEMs did what the ship builders asked and were benefiting handsomely as a result.

Luckily for the Navy and U.S. taxpayers, some "old-timers" still remained in civil service who had by this time achieved positions with sufficient clout to rectify this problem. They devised a revitalization initiative to avoid inapplicable and ineffective maintenance and reduce maintenance costs without sacrificing reliability. The initiative was based on three parallel efforts:

  • Rules - Improving maintenance requirements and plans (including reliability improvements)
  • Tools - Using computer and diagnostic technology (i.e., Condition-based maintenance)
  • Schools - Educating all levels of maintenance decision makers in reliability and condition-based maintenance principles

Commercial organizations suffer from the same problems that those in the Navy did in the 1990's. Consumers and promoters of keeping as many core industries in our country as possible pay the price for this error by humans engaged in maintenance and reliability. The error is that they forgot (or never learned about) the past.


The paper has expounded on seven (7) concepts that are essential to minimizing error by those engaged in Maintenance and Reliability. These are summarized below:

  • Exercise and practice leadership as well as management.
  • Look first at "programmatic" rather than "technical" solutions to reliability problems.
  • Look for indicators of small, seemingly insignificant but repetitious reliability problems and act on the findings.
  • Don't be afraid of mistakes; learn from them.
  • Become a Procedure Based Organization, but don't overdo it.
  • Eliminate as much maintenance as possible and increase emphasis on reliability

Don't forget the roots of your M & R program initiatives for improvement.

Having heard about those listed above, I'm certain those who read this text or hear it presented can come up with many more ideas on reducing the occurrence and impact of human error in maintenance and reliability. However, concentrating on these will make a big difference in achieving the goals and objective of your organizations.

The manta for modern maintenance and reliability programs everywhere could well be "Rules, Tools and Schools!"

Article submitted by, by Jack R. Nicholas, Jr., P.E., CMRP

Jack Nicholas Jr.

Jack R.Nicholas, P.E., CMRP, CRL, CAPT USNR (Ret.), became an internationally experienced and recognized author, workshop leader, advisor and consultant on reliability and maintenance, asset management and related subjects since retiring after 35 years from U.S. government service in 1988. He holds a certificate in Asset Management from the Institute of Asset Management in the United Kingdom.

ChatGPT with
Find Your Answers Fast