“Your system is perfectly designed to give you
the results that you get.”
W. Edwards Deming, PhD
How good is your organization at identifying failures? Of course you see failures when they occur, but can you identify when recurring failures are creating serious equipment reliability issues? Most companies begin applying RCA or RCFA to “high value failures”. While this is not wrong, I prefer to either not see the failure in the first place, or at the least, to reduce the failures to a controllable level.
Failure Reporting Analysis and Corrective Action System (FRACAS) is an excellent process that can be used to control or eliminate failures. This is a process in which you identify any reports from your CMMS/EAM or a specialized Reliability Software that can help you to eliminate, mitigate or control failures. These reports could include cost variance, Mean Time Between Failure, Mean Time Between Repair, dominant failure patterns in your operation, common threads between failures such as “lack of lubrication” (perhaps due to lubricator not using known industry standards). One poll was conducted recently covering 80 large companies. Shockingly, none of these companies were capturing the data required to understand and control equipment failures.
Answer the following questions honestly before you go any further to see if you have any problems with identifying failures and effectively eliminating or mitigating their effects on total process and asset reliability.
1. Can you identify the top 10 assets which had the most losses due
to a partial or total functional failure by running a report on your
maintenance software?
2. Can you identify the total losses in your organization and separate
them into process and asset losses for the past 365 days?
3. Can you identify components with a common thread due to a
specific failure pattern, such as the one shown in Figure 1?
Many times, the cost of unreliability remains unknown because the causes of unreliability are so many. Whether you want to point the finger at maintenance, production (operations) or engineering, each functional area plays a role in unreliability. Here are a few examples of those losses:
1. Equipment Breakdown (total functional failure)
A. Causes of Equipment Breakdown
1. No Repeatable Effective Repair, Preventive Maintenance,
Lubrication, or Predictive Maintenance Procedure
2. No one following effective procedures
2 Equipment not running to rate (partial functional failure)
A. Causes of Equipment not Running to Rate
1. Operator not having an effective procedure to follow
2. Operator not trained to operate or troubleshoot equipment
3. Management thinking this is the best rate at which the
equipment can operate because of age or condition
3. Off-Quality Product that is identified as “first pass quality”
(could be a partial or total functional failure)
A. Causes of Quality Issues
1. Acceptance by management that “first pass quality” is not
a loss because the product can be recycled
4. Premature Equipment Breakdown
A. Ineffective or no commissioning procedures. We are talking
about maintenance replacement of parts or equipment and
engineering/contractor that fails prematurely because no one
has identified if a defect is present after the equipment has
been installed, repaired, serviced, etc. (See Figures 2 and 3.)
(If you have ever seen equipment break down or not running to rate immediately after a shutdown, you know what we are talking about.)
The Proactive Workflow Model
Eliminating unreliability is a continuous improvement process much like the Proactive Work Flow Model in Figure 4. The Proactive Workflow Model illustrates the steps required in order to move from a reactive to a proactive maintenance program.
What the Proactive Work Flow Model Really Means to Your Organization
Implementing the Proactive Work Flow Model is the key to eliminating failures. The built-in continuous improvement processes of Job Plan Improvement and the Failure Reporting, Analysis, and Corrective Action System (FRACAS) help ensure that maintainability and reliability are always improving. All of the steps and processes have to be implemented in a well managed and controlled fashion to get full value out of the model.
The foundational elements of Asset Health Assurance are keys because they ensure that all of the organization’s assets are covered by a complete and correct Equipment Maintenance Plan (EM). These are requirements (not options) to ensure that you have a sustainable proactive workflow model.
You cannot have continuous improvement until you have a repeatable, disciplined process.
The objective of the Proactive Work Flow Model is to provide discipline and repeatability to your maintenance process. The inclusion of the FRACAS provides continuous improvement for your maintenance strategies. There are fundamental items you must have in place to insure that you receive the results you expect.
Think of FRACAS this way. As you have failures, you use your CMMS/EAMS failure codes to record the part-defect-cause of each failure. Analyzing part-defect-cause on critical assets helps you begin to make serious improvement in your operation’s reliability. Looking at the FRACAS Model in Figure 5, we begin with Work Order History Analysis, and from this analysis we decide whether we need to apply Root Cause Analysis (RCA), Reliability Centered Maintenance, or Failure Modes and Effect Analysis to eliminate or reduce the failures we discover. From the RCA, we determine maintenance strategy adjustments needed to predict or prevent failures. Even the most thorough analysis doesn’t uncover every failure mode. Performance monitoring after we make the strategy adjustments may find that new failure modes not covered by your strategy occur. You can now make a new failure code to track the new failure mode so additional failures can be tracked and managed when you review work order history. You can see this is a continuous improvement loop which never ends.
Steps to Implementing an Effective FRACAS
Let’s back up a little. The foundational elements of an effective FRACAS are an effective validated equipment hierarchy, criticality analysis, failure modes analysis, and equipment maintenance plans.
FRACAS Checklist:
Equipment Hierarchy should be built and validated so that similar failures on like equipment can be identified across an organization.
Criticality Analysis is developed and validated so that equipment criticality is ranked based on Production Throughput, Asset Utilization, Cost, Environment, and Safety.
Failure Modes Analysis is completed on all critical equipment using FMA, FMEA, or RCM.
Equipment Maintenance Plans are developed on all critical equipment to prevent or predict a failure.
Effective Equipment Hierarchy – Asset Catalog or Equipment Hierarchy must be developed to provide the data required to manage a proactive maintenance program which includes failure reporting or FRACAS (Failure Reporting, Analysis and Corrective Action System). In order to eliminate failures, one needs to ensure this is a successful first step. Figure 6 (on the following page) displays the findings from a plant with 32 total “Part – Bearing” failures from different size electric motors (“Part” is identified from a CMMS/EAM Codes drop down screen). One type “Defect – Wear” occurred in 85% of the failures (“Defect” is identified from a CMMS/EAM Codes drop down screen). In 98% of the cases, “Cause” was found to be ”Inadequate Lubrication”. Now it is time to perform a Root Cause Failure Analysis on this common thread of failures. (“Cause” as identified on CMMS/EAM Codes drop down screen). Once the hierarchy is established you can find similar failures in one area of an operation or across the total operation. Validation of the equipment hierarchy is required against the organization’s established equipment hierarchy standard. We are looking for “Part” – “Defect” – “Cause”. Maintenance personnel may not have the training or ability to determine the “Defect” (Predictive Maintenance Technician could identify Defect) and “Cause” can be typically identified by a maintenance technician, maintenance engineer, reliability engineer, or predictive maintenance technician.
After a thorough analysis you will find that most failures come from a small amount of equipment. The question is, “Which equipment?”.
Asset Criticality Analysis – Everyone says they have identified their critical equipment. But, in many cases, equipment criticality could change based on how upset people are about an equipment problem or because people are confused about what consequences associate to failure and the probability it will occur if we manage equipment reliability effectively. The purpose of the Asset Criticality Analysis is to identify which equipment has the most serious potential consequences on business performance, if it fails. Consequences on the business can include:
• Production Throughput or Equipment/Facility Utilization
• Cost due to lost or reduced output
• Environmental Issues
• Safety Issues
• Other
The resulting Equipment Criticality Number is used to prioritize resources performing maintenance work. The Intercept Ranking Model illustrates this process (Figure 7). On the “Y” axis you see the asset criticality is listed from none to high. I like using a scale of 0-1000 because all assets are not necessarily equal. Using the Intercept line which is struck down the middle, a planner or scheduler can define which job should be planned or scheduled first, or at least get close to the best answer, because management has already been involved in determining the most critical asset and the equipment has told you (on the “X” axis) which one has the highest defect severity (in the worst condition).
The only other two factors I would add in determining which job to plan or schedule would be based on work order type (PM, CM, CBM, Rebuild, etc) plus time on back. Figure 8 shows the 4-Way Prioritization Model for planning and scheduling.
Identify what equipment is most likely to negatively impact business performance because it both matters a lot when it fails and it fails too often. The resulting Relative Risk Number is used to identify assets that are candidates for reliability improvement.
A consistent definition for equipment criticality needs to be adopted and validated in order to ensure the right work is completed at the right time. This is the key to the elimination of failures.
Identification of Failure Modes – The goal of most maintenance strategies is to prevent or predict equipment failures. Equipment failures are typically caused by the catastrophic failure of an individual part. These parts develop defects, and when left alone, those defects lead to the ultimate catastrophic failure of the part. The defects are, in turn, caused by “something”. Eliminating that “something” (the cause) will eliminate the failure.
The primary goal of an effective Preventive (PM) program is to eliminate the cause and prevent the failure from occurring. The primary goal of a Predictive Maintenance (PdM) or Condition Based Monitoring (CBM) Program is to detect the defects and manage the potential failures before they become catastrophic failures. In addition, many program tasks are designed to maintain regulatory compliance. Many companies have PM programs. However, many of the tasks in them do not address specific failure modes.
For example: An electric motor with roller bearings has specific failure modes which can be prevented with lubrication. The failure mode is “wear” caused by “Inadequate Lubrication”. The next question may be why you had Inadequate Lubrication. The Inadequate Lubrication could be identified as a result of no lubrication standard being established for bearings. In other words someone gives the bearing “x” shots of grease even though no one knows the exact amount to prevent the bearing from failure.
The best way to identify failure modes is to use a facilitated process. Put together a small team consisting of people knowledgeable about the equipment, train them thoroughly on the concept of part-defect-cause, and go through the basic equipment types in your facility such as centrifugal pumps, piston pumps, gearboxes, motors, etc.. You will find that a relatively small number of failure codes will cover a lot of failure modes in your facility. The failure modes developed during this exercise can later become the basis for the failure modes, effects, and criticality analysis that takes place during Reliability-Centered Maintenance (RCM) projects. In our book, we focus on failure mode identification as an output of FRACAS (Failure Reporting, Analysis and Corrective Action System), which, again, is a strong continuous improvement process.
If, over a period of one year, the dominant failure mode is “wear” for bearings caused by Inadequate Lubrication then one can change or develop a standard, provide training and thus eliminate a large amount of failures.
The problem is that most companies do not have the data to identify a major problem on multiple assets (No data in equals no effective failure reports out). For example, it isn’t the motor that fails; the motor fails because of a specific part’s failure mode, which then results in catastrophic damage to the motor. Unless, of course, the defect is identified early enough in the failure mode.
Maintenance Strategy – The maintenance strategy should be a result from either a Failure Modes and Effect Analysis, Reliability Centered Maintenance or from failure data collected from your CMMS/EAM.
Elimination Strategy: The best way to eradicate this deadly waste is get a better understanding of the true nature of the equipment’s failure patterns and adjust the Maintenance Strategy to match.
- Andy Page CMRP
So what is a maintenance strategy? Let’s break down the two words: Maintenance is to keep in an existing condition, or to keep, preserve, protect, while Strategy is development of a prescriptive plan toward a specific goal.
So, a Maintenance Strategy is a prescriptive plan to keep, preserve, or protect an asset or assets. Keep in mind that one specific type of maintenance strategy is “run to failure” (RTF). However, RTF is used only if, based on thorough analysis, it is identified as the best solution for specific equipment to optimize reliability at optimal cost. Less invasive maintenance is preferred to more invasive maintenance. This is one of the fundamental concepts of any well-defined maintenance strategy. Specific maintenance strategies are designed to mitigate the consequences of each failure mode. As a result, maintenance is viewed as a reliability function instead of a repair function. Saying this means Predictive Maintenance or Condition Monitoring is the best solution because it is mainly noninvasive.
Knowing that both systemic problems and operating envelope problems produce the same type of defects, a maintenance strategy that merely attempts to discover the defects and correct them will never be able to reach a proactive state. Technicians will be too busy fixing the symptoms of problems instead of addressing the root cause. To reach a truly proactive state, the root cause of the defects will need to be identified and eliminated. Maintenance strategies that accomplish this are able to achieve a step change in performance and achieve incredible cost savings. Maintenance strategies that do not attempt to address the root cause of defects will continue to see lackluster results and struggle with financial performance.
A Maintenance Strategy involves all elements that aim the prescriptive plan toward a common goal. Key parts of a maintenance strategy include Preventive and Predictive Maintenance based on a solid Failure Mode Elimination Strategy, Maintenance Planning consisting of repeatable procedures, work scheduled based on equipment criticality, work executed using precision techniques, proper commissioning of equipment when a new part or equipment is installed, and quality control using Predictive Maintenance Technologies to ensure no defects are present after this event occurs. The very last part of your maintenance strategy is FRACAS, because it drives the continuous improvement portion of this strategy.
Failure Reporting
Failure reporting can come in many forms. The key is to have a disciplined plan to review failure reports over a specific time period, and then to develop actions to eliminate failure. Following are a few Failure Report examples, which should be included as part of your FRACAS Continuous Improvement and Defect Elimination Process.
1. Asset Health or Percent of Assets with No Identifiable Defect – reported by maintenance management to plant and production management on a monthly basis at least (see Figure 9). An asset that has an identifiable defect is said to be in a condition RED. An asset that does not have an identifiable defect is said to be in condition GREEN. That is it. It is that simple. There are no other “but ifs”, “what ifs” or “if then”. If there is an identifiable defect the asset is in condition RED. If there is no identifiable defect, it is GREEN. The percentage of machines that are in condition GREEN is the Asset Health (as a percentage) for that plant or area.
The definition for defect is: an abnormality in a part which leads to equipment or asset failure if not corrected in time.
Example: the plant has 1,000 pieces of equipment. Of that number, 750 of them have no identifiable defects. The plant is said to have 75% Asset Health. There is an interesting aspect about Asset Health. Once this change is underway, Asset Health, as a metric, becomes what most maintenance managers and plant managers have wanted for a long time — a leading indicator of maintenance costs and business risk.
2. Mean Time Between Failures and Mean Time Between Repairs – reported by maintenance or reliability engineers on a monthly basis on the top 5-20% of critical equipment. The report to management should include recommendations to improve both metrics and should be measured and posted on a line graph for all to see.
3. Cost Variance by area of the plant – reported by maintenance and production supervisor area of responsibility. Cost variance must be reported to maintenance and production management on a monthly basis. The report should not be acceptable without a known cause of the variance and a plan to bring it in compliance.
4. Most Frequent Part-Defect-Cause Report – reported monthly by maintenance or reliability engineers. If you do not have maintenance or reliability engineers, you may need to appoint a couple of your best maintenance technicians as “Reliability Engineering” Technicians, even if unofficially, and train them to be a key player in this failure elimination process. This one report can identify common failure threads within your operation which, when resolved, can make a quick impact to failure elimination.
There are many more reports that can be used effectively, but will not fit in the space of this article. You will be able to find more reports in the book on “FRACAS” written by Ricky and Bill, which will be published by mid July.
Bill Keeter is currently a Senior Technical Advisor for Allied Reliability. Bill joined Allied in 2006 after serving as President of BK Reliability Engineers, Inc. where he provided training and facilitation services to help facilities improve asset performance using Weibull Analysis, Reliability Centered Maintenance, Availability Simulation, and Life Cycle Cost Analysis. Bill has over 30 years of experience in Maintenance Engineering and Management. He has successfully implemented maintenance improvement programs in a variety of manufacturing and production facilities. Bill’s experience includes maintenance leadership positions in the US Military, the nuclear industry, chemicals, paper converting, and plastic film manufacturing. He has provided training and reliability consulting services to petroleum, process, mining, and defense industries in the United States, Mid-East, and Europe. Bill has developed competency maps for Reliability, Availability, and Maintainability Engineering for the Petroleum Industry’s PetroSkills® program.
Bill has published articles in a variety of internationally recognized maintenance publications, and has presented papers on the practical application of Weibull Analysis at several internationally attended Maintenance and Reliability Conferences. Bill is a Certified Maintenance and Reliability Professional with the Society for Maintenance and Reliability Professionals Certifying Organization. You can contact Bill at bkeeter@gpallied.com
Ricky Smith is currently a Senior Technical Advisor with Allied Reliability. Ricky has over 30 years experience in maintenance as a maintenance manager, maintenance supervisor, maintenance engineer, maintenance training specialist, maintenance consultant and is a well known published author. Ricky has worked with maintenance organizations in hundreds of facilities, industrial plants, etc, world wide in developing reliability, maintenance and technical training strategies. Prior to joining Allied Reliability in 2008, Ricky worked as a professional maintenance employee for Exxon Company USA, Alumax (this plant was rated the best in the world for over 18 years), Kendall Company, and Hercules Chemical providing the foundation for his reliability and maintenance experience.
Ricky is the co-author of “Rules of Thumb for Maintenance and Reliability Engineers”, “Lean Maintenance” and “Industrial Repair, Best Maintenance Repair Practices”. Ricky has also written for several magazines during the past 20 years on technical, reliability and maintenance subjects. Ricky holds certification as Certified Maintenance and Reliability Professional from the Society for Maintenance and Reliability Professionals as well as a Certified Plant Maintenance Manager from the Association of Facilities Engineering Ricky lives in Charleston, SC with his wife. Aside form spending time with his 3 children and 3 grandchildren, Ricky enjoys kayaking, fishing, hiking and archaeology.
If you would like to be notified before the release of the new book, or would like to contact Ricky with questions, send him an email at rsmith@gpallied.com.