Website update in progress! You might be logged out of your account. If this occurs, please log back in.

Website update in progress! You might be logged out of your account. If this occurs, please log back in.

Sign Up

Please use your business email address if applicable

The Quest of the 2 Questions - A Basic View of Industrial Reliability & the Things that Hold Program

by Peter Chalich

This is the first of a two-part series. Part 1 covers value and methods for understanding how our equipment is failing. Part 2 will address the value and methods for understanding the services that we may or may not be providing our equipment.

question marks


Imagine a chaotic industrial setting where the equipment rules over humans through untimely breakdowns. The results are huge losses to the bottom line and much human energy and toil being expended on inefficient and inconvenient emergency repairs. If this sounds uncomfortably familiar, then perhaps you should read on.

After 25 years of working within a variety of industrial settings, I have concluded that most people facing an extremely reactionary industrial maintenance setting realize that things could be and should be better. They can visualize a world where everything, including organized outages, runs without interruption. They can visualize a world where operations hands over equipment at the agreed upon time and maintenance returns the equipment at the agreed upon time. This is the vision, but how do you get there? Most people can understand where they are and most can understand the vision. However, understanding how to get from one to the other is not so easy.

So how do we plot a course to the visionary state? I have come to believe that clues to this are within our ability to answer two seemingly simple questions: How is the equipment failing? and What are we doing to the equipment? It has been my experience that, outside of industries such as aviation, pharmaceutical and anything nuclear (where the cost of equipment failures are extraordinarily high), many, if not most, organizations cannot really answer either of those two questions.

You see, it’s all about the equipment. The fulfillment of our vision depends on how well we understand what our equipment is experiencing, both in terms of application and the service we provide. Through this understanding, we can work to reduce the likelihood of unanticipated failures, thereby increasing the likelihood of an uninterrupted run to the next planned outage. This can be expressed by the single verb – reliability. And the result can be expressed by the single noun - reliability.


If your business is process constrained, the first and most pressing question is: How and why is your equipment failing? This is because failures go straight to the bottom line through production shortfalls, makeup, or rework. Fortunately, there are several tools available that help address the question of how the equipment is failing. Most notable are delay tracking and analysis, breakdown work order history, root cause analysis and condition monitoring.


Delay tracking and analysis is where we look at how the process has failed. The dominant data coming out of delay analysis is time or production units lost and the number of process interruptions. Next is the type of interruptions. Some delays are caused by operations issues, others by equipment issues. The more robust systems may even be able to provide some insight into equipment failure detail. It is clearly a retroactive look, but we can learn a lot by studying how the process failed.

Most organizations have some form of delay tracking mechanism where information about process delays is entered. However, it has been my experience that this information is rarely looked at. There are several reasons for this. Firstly, it’s a tedious, boring task that is viewed by most with the same zeal as filling out one’s tax returns. Secondly, while important, it is never urgent. All urgent tasks supersede all important tasks and the workplace is full of urgent tasks. Thirdly, the data is always suspect. This should come as no surprise. When people realize that nobody ever looks at the data, they soon find the path of least resistance, which usually involves either entering crap or nothing at all. Lastly, the coding used to capture this information is usually poorly thought out and is often a mixture of causes, effects and remedies. As such, on those rare occasions when analysis is actually performed, the result is a Pareto chart showing “blank” and “other” as the top two columns.

Delay analysis is a big topic and there are all kinds of analysis tools out there. The important thing is to get started, use whatever data is available, act on the findings and make all this as visible as possible. As stated earlier, most organizations have some form of delay data and it has been my experience that every data set has a story. It’s your job to find the story. Locate whatever data is available, brew yourself a pot of coffee and lock yourself away so you can have some uninterrupted time to find the story. This is somewhat like turning over rocks. Most rocks have nothing under them, some may have clues that lead to other rocks, but a few will have jewels of opportunity.

For me, basic analysis usually includes building a pivot chart in a spreadsheet program that includes all the pertinent data in the data set. I like to start with six to nine months worth of data. This tends to smooth out some of the special causes. It is important to remember that we are primarily interested in systemic issues rather than one-offs. If possible, create Pareto charts for the number of occurrences by equipment, the losses (hours, minutes, tons, units, etc.) by equipment and the loss per occurrence by equipment. The number of occurrences by equipment is a measure of unreliability and offers clues as to where efforts at preventing failures might add the most value. Losses by equipment are a mixture of unreliability and the consequence of each occurrence. When this is combined with the loss per occurrence information, we often have clues as to where efforts in the area of maintainability might add the most value.

In any case, cover the results with those closest to the equipment. There is no substitute for face-to-face discussion with the people that operate and maintain the equipment. This communication serves two purposes; it validates or invalidates the results and it shows people that the data is being used. Used data is data that is more likely to be inputted and outputted correctly. Do not allow yourself to completely discount the value of the data. Doing nothing because the data is perceived as being inaccurate does not advance anything. Remember, there is always a story in the data, even if the story is that the data systems have to be repaired. As with condition monitoring, act on the findings and then broadcast the value.


Breakdown work order analysis is another tool for taking a retroactive look at how things went wrong. We should be able to look up what was done, when it was done and what it cost. If the system is robust enough, we may have some information on failure modes and frequency. The problem, however, is that breakdown work is often performed without a work order or against a blanket work order. In both cases, the useful information is lost. As with delay tracking, there are several reasons for this. Firstly, while posting work against work orders is important, it is not urgent. Getting the process up and running is urgent. The “important” can never quite compete with the “urgent.” Secondly, entering and closing work orders with proper detail and coding is tedious and boring. Additionally, we usually depend on the same folks to fill out work orders that we have groomed into adrenaline junkies through years of recognition for putting out fires and saving the day. Lastly, as with delay analysis, we rarely look at this data and when we do, it is not visible to the people entering the data.


There are many tools available for performing root cause analysis (RCA). With RCA, we take an in-depth look at the details of individual equipment failures to hopefully identify the most basic causes. If we do this correctly and properly act on the causes, we can either prevent recurrence or at least mitigate the future effects of the failure in question. Furthermore, these root causes often serve as enablers for other potential failures. This means if we truly rectify the causes of the failure in question, we are likely to reduce the risk of potential failures for those of which we may not even be aware.

Some key points to RCA are to know when to conduct an analysis and preserve the evidence, when to use an organized approach to the analysis and, most importantly, when to act on the findings. However, many organizations suffer from shortfalls in all these key points. For example, timing is everything with RCA, but many organizations don’t know when to perform a RCA. Usually RCAs are invoked as an emotional response to the sting of a significant failure event. Sometimes, they are actually used as a form of punishment for “allowing” this failure to occur. This triggering mechanism, while bad enough on its own merit, suffers further by the lag time involved. By the time the decision is made to conduct an analysis, much of the evidence has been lost. This is exacerbated by the urgency to get the process up and running. Again, we have the competition between the important and the urgent. RCA may be viewed as important, but not urgent. Mixed in with this is the fear that the analysis will assign blame for the failure. Given all this, evidence disappears and weakens the analysis. Even if there is the intention to preserve evidence, there is often no means to do so. Evidence preservation is much more effective with some up-front work. Having a designated and commonly understood place to take and store evidence, as well as commonly understood evidence collection procedures, goes a long way towards preserving evidence.

The analysis methodology is useful in keeping the analysis organized. There are many good, off-the-shelf methodologies available and many of these offer software tools that help with organization and perform much of the administrative work. The key points are to identify a methodology, install the tools, train a core group and then put this stuff to use. Many organizations either don’t have a methodology identified or they are not competent in its use. It’s okay to have more than one methodology if different purposes are truly served, but as a general rule, discourage excursions from the methodology. Competency with the methodology comes only with use and translates into analysis speed and accuracy.

Acting on the findings is probably the weakest area for many organizations. This is because many root causes are systemic or organizational in nature, thus requiring working across department lines. Shortcomings in this

area are exacerbated by the RCA trigger likely being an emotional reaction to the failure event. By the time the analysis is complete and the recommendations identified, much of the emotion has subsided and other urgent matters have taken over. Organizational understanding, vision and commitment to improvement are required to get the most out of RCA efforts.


In this context, condition monitoring is used to detect when a failure has begun, but the part, component, equipment, process, etc., is still functional. This concept has been well illustrated in John Moubray’s now famous P-F curve. In this instance, we are talking primarily about predictive maintenance (PdM) technologies, such as vibration analysis, fluid analysis and thermography. Also included in this realm is sensorial inspection, where issues can be seen, felt, or heard. Condition monitoring is different from the previously discussed tools because it is a real time look, not a retroactive one. With a real time process, we can act on condition monitoring findings soon enough to mitigate some of the effects of the ensuing failure. Here there should be a sense of urgency. However, this urgency is often diluted by issues of faith in either the technology or those deploying it. This is often the case when the use of technology is new to the organization. It also happens when contractors are used to perform the analysis and the results are presented to an organization without enough internal knowledge to engage in any meaningful dialogue on the analysis or its conclusions. Furthermore, there is inherent reluctance within the human species to remove something from service while it still has some life in it. This is because we tend to apply the same logic to things that we apply to ourselves and nobody wants to be removed from service prematurely.

The two best ways that I have found to combat the reluctance to act on findings are competency and visibility. The more competent an organization is in the use of condition monitoring tools, the more likely it is to understand the nature of the risks and opportunities the findings present. This understanding increases the probability of action. Just as competency builds faith in a given technology, visibility builds faith in the overall condition monitoring program. Findings should be tracked, published and discussed whenever possible. It is imperative to keep this information out in front of the organization. In short, make findings difficult to ignore.

Keep reading...Show less
ChatGPT with
Find Your Answers Fast