There were already a large number of engineers actively engaged in "reliability" related activities. I thought that most people already understood and supported "reliability" as a core value. I wondered why a "Reliability Workshop" was needed.
Now, twenty years later, I have a much better answer to that question. Many if not most companies still need to have a reliability workshop, and the objectives of each workshop should attempt to answer the question, "What level of reliability do I have a right to expect?" To answer that question, companies must first understand all the elements that affect "reliability" and then evaluate how well they are dealing with each of them.
What do you have a right to expect?
"What you have a right to expect" is the result of the reliability characteristics that were designed and built-in to your systems and equipment, and have been preserved through proper operation and maintenance. If you are able to recognize the things that cause failures, you will be able to make a realistic evaluation of what you have a "right to expect" because you can determine the amount of failure prevention that has already been applied. You will also have a head start on achieving the level of performance you desire.
When a device is designed and manufactured, the included components have a certain level of robustness and the configuration provides a certain amount of redundancy.
The combination of component choices and configuration leads to the characteristic best described as "inherent reliability". No matter how well you operate and maintain a system, it cannot perform better than its inherent reliability. If the system is operated and maintained as well as possible, you will harvest the maximum inherent reliability. If you operate or maintain the system in a manner less than optimum, you will experience a reliability performance that is something less than the inherent reliability.
Each part of an item of equipment tends to deteriorate over time. There are things you can do to minimize the deterioration. Generally, this deterioration will lead to failure at some point in time. If you understand the deterioration rate and the current status of deteriorating components, it is possible to intervene before the failure takes place. These actions that minimize the deterioration, or intervene before failure, are best described as "proactive maintenance". Proactive maintenance that is intended to simply monitor the current situation is a form of "predictive maintenance". Proactive maintenance that is intended to change deteriorated components before they fail are forms of "preventive maintenance". To capture all the available "inherent reliability", you need to implement an optimum program of proactive maintenance.
Despite how "perfect" you believe your proactive maintenance program to be, there are always new defects or forms of deterioration that you did not expect. These new defects will result in unexpected failures, and the way you deal with them through "reactive maintenance" will affect the reliability. If your reactive maintenance system responds properly, does a good job of diagnosing and troubleshooting the problem, repairs it correctly, verifies the repair, records "as found" and "as left" conditions (so that the deterioration rate can be calculated), keeps good records, and installs proactive tasks that will intervene before the next failure, then you will harvest all the available inherent reliability.
Saying this statement much more succinctly, if you have calculated the expected inherent reliability and you are certain you have good proactive and reactive maintenance programs, then you know "what you have a right to expect" for reliability performance.
"Reliability" or "reliability"
Many people who are not reliability experts tend to group several other characteristics under the heading of reliability. For the sake of discussion, I use the term "Reliability" (with a capital R) to describe the concept of reliability that includes all these elements.
First, reliability (with a small r) is defined as a measure of the instantaneous likelihood that a system or device will fail in a given period of time. My best analogy for reliability is a die (half of a pair of dice). Assume that the number one is a defect and when the one comes out on top a failure will occur. The reliability then is five-sixths and the unreliability is one-sixth. As long as the defect exists, there is some likelihood that a failure will occur. The only way to eliminate or reduce the likelihood of failure is to eliminate the defect.
Second, many people tend to roll the characteristic of "availability" into their perception of a system's Reliability. Availability is a ratio of "up-time", or time the system or device can perform its intended function, to "total time".
Total "down-time" or out-of-service time is the sum of all planned down-time and unplanned down-time. "Planned availability" is determined by how long a system can operate between planned outages, and how much time is needed to conduct the outage. "Unplanned availability" is determined by how frequently unplanned outages occur (reliability), and the amount of time needed to respond to unplanned interruptions.
Third, many people also tend to roll many of the characteristics of "maintainability" into their concept of Reliability. Maintainability is a measure of your capability to return a system or device to full inherent reliability in a ratable period of time. If you were to say, "It will take three hours to fix it, but I don't know how reliable it will be", the device would not be maintainable. Also if you were to say, "I don't know how long it will take to fix it, but when I finish it will be right", it is also not maintainable. To be "maintainable", you need to be able to both restore the inherent reliability and to do it in a known amount of time.
All the elements of Reliability are not strictly parts of reliability, but many asset managers tend to include those characteristics when demanding Reliability improvements.
If you are attempting to answer the question, "what do you have a right to expect?" and the question is intended to address aspects of availability and maintainability, you will need to be ready to answer some additional questions.
Drinking out of a Firehose
You are probably asking yourself, "How can this individual expect to explain the whole subject of reliability in such a small text?" The answer is that I am not. I have two objectives for this text. The first is to describe a good starting point for individuals and organizations who are willing to admit they are still new to reliability. Second, I am going to try to fill a gap that most reliability experts have ignored. That gap is the one between being a highly reactive organization (and gathering little information) and having sufficient information to begin the journey to becoming a proactive organization. I will describe an approach that will be useful and valuable to reliability engineers, as well as being worthy of resource investment by asset managers.
A number of years ago I purchased a text entitled "The Little Black Book of Project Management". At the time it was published and for sometime thereafter, there were few comparable texts on Project Management. Few individuals seemed to come "equipped" with project management skills. As a result, I repeatedly loaned the book to subordinates to help increase their knowledge and improve their skills. As usual, that practice turned out to be a good way to lose a book, so I was without a copy for several years. A few years ago I spotted another copy on the shelf of a used bookstore and purchased it. Again, I am regularly making the mistake of lending my copy out.
In exchange, I get improved performance by young engineers assigned to manage projects. My objective here is to create a text that is as useful to others who are trying to improve reliability performance as the "Little Black Book on Project Management" has been for me. The key characteristics are:
• The book is relatively short.
• Application of the approach described in the book is straightforward.
• Usefulness is not limited by scale. The book is equally useful to both large and small enterprises.
• You don't need to be an expert to use the knowledge
Prepare for Change
Like most things of any value, application of the techniques found in this book require change. The most significant change confronting the individual is a change in roles. The most significant change confronting an organization is a change in the corporate culture. To implement the approaches described in this text, both kinds of change will be required. A number of individuals will need to add tasks and change the way they are performing some of their current tasks. The organization will need to become unwilling to accept sloppiness in gathering facts and using information.
It is important to keep in mind that the tasks here discussed are closely integrated with tasks currently being accomplished within most organizations. Current meetings, current planning and scheduling protocols, and current organizational structures, will need to be modified in a thoughtful, integrated manner if optimum results are to be achieved.
One last issue .... when you think about how things have changed in the last fifteen or twenty years, there are few that have changed as dramatically as those that have been affected by computerization. Before Reliability Centered Maintenance (RCM) became popular, there were a few organizations that seemed to be light-years ahead of everyone else in terms of equipment reliability. One might ask how they achieved their performance. The answer was that those companies exercised the patience and discipline to track failures and their causes. At some point they were able to recognize patterns and relationships that were hidden in the data, and use that information to prevent failures before they occurred.
One cannot over-emphasize the dedication needed to record, store and analyze information before computers became available. There were cabinets full of paper files. The files in those drawers were faithfully maintained and the data was transferred to manual graphs that made the patterns and relationships more apparent. Most organizations that successfully accomplished this effort were led by a single-minded individual and staffed over a long period by a group of highly-dedicated people.
In today's work environment, individuals are seldom allowed the luxury of single-mindedness. They are expected to dilute their thoughts and standards to fit in with other members of their team. There are far fewer individuals in each work group, and few assignments last more than a few years. So the "corporate memory" must come from some other mechanism.
Fortunately, many of these shortcomings can be addressed by supporting the processes with computerized files. In fact, without the use of a computerized file system, this initiative would be a foolish undertaking. Few organizations have the determination and discipline to make it work without having key functions automated by a computerized filing system.
This is not to say that a well-designed computerized system will eliminate the need for human interaction and administration. There are a wide variety of elements that can go astray if they are not properly managed.
One example we will discuss is "bucketing", a term used for classifying initial failure reports (Failure Notifications) and closing reports (Failure Modes). One way in which many systems become corrupted is by allowing too many people to define classes of failure. If individuals are allowed to create a new class every time they cannot find an exact fit, there will soon be too many classes that are only slightly different from each other. When only a portion of each true failure mode is assigned to each of a number of similar failure descriptions, the final statistics can point to an incorrect failure description as being the most statistically likely. This improper result will cause inappropriate corrective actions to be taken.
The most successful system will combine a well designed computerized database and the right amount of human interaction to ensure it is not misused or corrupted by individuals who lack an overall understanding of the system design and objectives.
If this is beginning to sound like a lot of book-keeping, it is. Good reliability management is a matter of understanding how your equipment fails. This understanding must depend on facts, not speculations or beliefs. In many ways, your equipment, and the way it operates and is maintained, is unique. As a result your reliability information will be unique.