Taking Notice of the Telltale Signs

13 December 2018

Hindsight is the ability to understand an event or situation only after it has happened. How many times have you witnessed an asset failure and realized during the root cause analysis (RCA) investigation that there were indicators that a catastrophic failure was about to occur? Have we, as maintenance reliability practitioners, become desensitized in their ability to recognize the telltale signs of a failing asset? Is this being accepted as the new normal?

The RCA for an asset failure reveals obvious and not so obvious indicators that these events are likely to occur, with most of these indicators being well in advance to the event occurring. These indicators have signals, ranging from weak to strong. We accept these anomalies because they don’t cause problems immediately or they haven’t noticed that they caused problems in the past. They normalize them and accept them. They have become complacent. But, complacency has no place when talking about improving asset reliability. So, how can you remove the complacency around these leading indicators?

To reverse this, organizations need to create a sense of unease, and not just when a major failure occurs, and not be complacent with the health of their assets.

Chronic unease is having discomfort and concern about the management of risks. It is a healthy dose of skepticism about decisions and inherent risks that remain. Simplified, it is the gut feeling that occurs when you are not quite confident in your decisions and your assessment of what is going on. It is the opposite of complacency.

Nothing ever happens out of the blue. There are always warning signs that people choose to ignore, either consciously or unconsciously. Once you start listening to what is happening around you, you will be surprised at just how much those little signs will tell you.

If the production team announces in a morning meeting that it broke the previous production record in the last 24 hours, what would you think? Initially, you might think, “Wow, that’s great!” If you were to add a healthy dose of skepticism, you would think about how they managed to do this. What would come to mind?

Did they operate the equipment within the guidelines
of the operating strategy?

Did they place excessive stress on the system?

If they did place stress on the system, how has this affected the
health of the equipment and its longevity?

Does this mean early equipment failures can be expected
because of how they operated the equipment?

How comfortable do you feel now? Perhaps you are feeling a little uneasy about the health of your assets.

If you take a step back, prevention is the best cure. This can happen as early as the concept and design stage of a project, upon review and identification of a poorly performing asset, or when an asset has failed and potentially caused secondary damage.

Reliability advocates use risk assessment processes and tools to identify and mitigate potential failure modes before they occur. Identifying asset expectations and performance standards, what would cause them not to be met and what needs to be done to prevent them or mitigate their consequences if they did occur should be done at the design stage and utilized to help develop operating and maintenance procedures. A key aspect of this is identifying the level of risk the company is willing to take. This will help in understanding what is accepted as normal and what is identified as a deviance from the norm.

When prevention isn’t possible, early detection and rigorous RCA processes are essential.

Early detection through condition monitoring and predictive analytics are ways for managing some of the failure modes you are not able to eliminate or ones you have not yet identified and thought were possible. If identified early enough, it will save a lot of heartache. Unfortunately, it’s easy to fall into a trap where you think you did a great job identifying an issue early, but then you didn’t investigate any further to determine its root causes. Unlike an event where you didn’t identify an issue early on and it resulted in a catastrophic failure, you do not have to have a catastrophic failure in order to do an RCA. An RCA can be done when an asset or system is no longer able to meet its performance standard or is trending toward not meeting its performance standard.

For example, the onset of failure for a pinion bearing on a ball mill was identified through vibration analysis and confirmed with oil analysis. The issue was identified early so it could be planned and scheduled with little impact on the maintenance and production teams. However, there was no follow-up as to how this bearing got to be in this condition. Without a doubt, there would have been a lot of management noise and an RCA carried out to identify the root causes and actions taken to prevent reoccurrence if there was no early indication from vibration or oil analysis and the bearing catastrophically failed.

RCA identifies what you missed and what you need to do to prevent the undesirable event from happening again. It also can help identify other gaps and deviations that you were previously blind to. For example, an RCA investigation found that the flow and pressure instruments on a hydraulic system were routed incorrectly and reported to the wrong locations. They were not a cause for the failure, however, they are a deviance from the norm. They may result in a failure or not detect the onset of a failure later on, so it would make sense to fix them now.

Reliability-centered maintenance (RCM), failure mode and effects analysis (FMEA) and RCA are all tools reliability practitioners must work toward mastering. They provide the methodology for identifying what is holding a company back from achieving its desired level of reliability. Reliability practitioners need to utilize the reliability tools they have for identifying and managing risks.

A few more things need to happen. To regain control, reliability practitioners need to identify the things in their plants that have deviated from what was previously determined as normal. If no determination has been previously done, identify what is an acceptable standard for the business that can be classified as normal and what is not.

Reliability practitioners need to listen closer to the weak signals and that gut feeling that says something isn’t quite right. Take action and look to validate or veto your gut feeling.

People tend to accept what they are willing to walk past. Summon the courage to slow down, take a step back and completely understand what is going on before quickly fixing the issue, pressing the reset button, giving the folks a pat on the back, returning to full production and forgetting anything ever happened. This is the hardest step. Companies have production targets to meet and to slow things down by being the skeptic will not sit well with a lot of people.

It is also difficult to do with limited resources, making it impossible to do an RCA for every event or when you feel uneasy. Pick your battles, but grasp every opportunity you get.

It’s no good wondering, “If given the option to go back in time, what could I have done to prevent these events from happening?” Time travel is not possible yet, so take the time now before it’s replaced with hindsight.

Reliability Engineering for Maintenance