Maintenance Triage: Identifying Sick and Injured Assets to Improve Population Health

by Will McGinnis

Attempting to contextualize generally obscure statistical predictions about machines is an all too common practice. While statistics and predictive analytics are not new concepts, they have not yet been applied deeply enough to mechanical processes to be embedded in the vernacular. With a focus on the actionability of analysis, similes help to convey the messages.
This article presents one of them.

Machines As Patients

In terms of failure, the primary difference between biological and mechanical systems is, for the most part, that small variations strengthen the biological and weaken the mechanical. A biological system dropped into an environment for which it is ill suited will eventually adapt to it, while the mechanical will simply fail. Despite this fundamental difference, they can be approached the same on the doctor's table.

Sickness

People and machines both get sick. The chronic propensity to fail can be seen in both terminally ill patients and mis-sized or improperly maintained machinery. Both cases are characterized by a likelihood of future failure that is independent of short-term variation in environment,
behavior or process.

Injury

Likewise, both people and machines can be injured. With a person, this could be a broken arm. In most cases, it is a single, unexpected event that is unlikely to be repeated unless the environment consistently reproduces those events (e.g., a professional skateboarder). The mechanical equivalent would be an operator's error. A broken part due to operator error is not necessarily an indication of a long-term misapplication of the machine, but rather an isolated event caused by outside factors. The only ways to avoid injuries are to change the environment or improve the process (e.g., quit skateboarding or train more to fall less).

Sickness vs. Injury

With these similes, it can be pretty simple to visualize and understand the cases of mechanical failure. Sickness is environmentally independent and chronic, while injuries are heavily influenced by the environment and are acute. A machine can be both sick and injured, or just one of the two.

Measurements for both machine sickness (HealthScore) and injury (Warnings) can be seen in Table 1.

Table 1

There are four assets with the four boundary cases of condition:

  1. Total Health (004)
  2. Injured, but well (001)
  3. Intact, but sick (002)
  4. Sick and Injured (003)

The goal is a machine with both a high HealthScore and no warnings (predicted injuries). The opposite of this ideal is a sick and injured machine, with both a low HealthScore, indicating chronic risk of failure, and a warning, indicating imminent likelihood of injury.

In between these two extremes are machines with a warning, but a high HealthScore, indicating an environmental risk but no systemic misapplication, and machines with no warning but low HealthScore, indicating a systemic misapplication that is being well accounted for by the environment (i.e., users and maintainers). These are both interesting cases because they can be easily rationalized as good results, but while they are far better than sick and injured, they can be greatly improved by changing the environment or system, respectively.

Figure 1 shows an example of how a population of assets might look visualized in this way.

Plants As Populations

After establishing this idea of machines as patients (i.e., biological-esque entities), the natural extension is to look at a plant as a population of such entities. Scholar Nassim Taleb describes all things as being along a scale from fragile to anti-fragile.

If something is fragile, then a small, random variation makes it weaker. Conversely, if something is anti-fragile, then small, random variations make it stronger. Right in the middle of the two is robust or resilient, where variations do not affect the item.

Individual machines are generally fragile by this definition. Likewise, individual people are, in many cases, fragile. A population of people, however, can be extremely anti-fragile, adapting over time to changes in the environment and gradually improving quality and longevity of life. The goal of a plant should be a population of machines that resembles a population of people or animals more so than a house of cards, where a collection of fragile items is more fragile than its constituent parts.

Selection

The first key concept that contributes to the anti-fragility of biologic populations is selection. Those animals or bacteria that are least fit for the environment do not continue on into the next generation. Likewise, to increase the anti-fragility of a plant, the fitness (health) of the machines needs to not only be tracked, but used to inform the replacement of machines. Machines that are chronically unfit (unhealthy) must be replaced with machines that are less predisposed to this condition. Where the machines themselves cannot be changed to be more fit for the environment, the environment must be changed.

Continuous, iterative improvement of both the conditions surrounding the machines in a plant and the fitness of the machines to that environment are critical in building the capacity for the plant to not only withstand unexpected variations, but to benefit from them.

Compartmentalization

The other critical contribution to a population's anti-fragility is compartmentalization. The lower the effect that one failure has on other members of the population, the more anti-fragile that population will be.

Taleb uses the examples of airlines and banks. If a plane accidentally crashes somewhere in the world, other planes are no more likely to crash because of it. In fact, the aviation industry will learn from the accident and improve its processes so future planes will be less likely to crash, the very definition of anti-fragility. A bank, on the other hand, is more likely to collapse if one of its peers collapses. The financial crisis of 2007 is testament to this fact; the interconnectedness of the global economies (lack of compartmentalization) makes the population of banks fragile.

For a plant, this means the goal should be to minimize the impact of failures on downstream processes as much as possible. Having described the idea of a plant as a population of entities with individual variations in health and injury, what can you actually do in your plant?

You want to have less downtime, lower maintenance costs and fewer accidents, and make more money. In a world of certain variation, you want your plant to have lower risk and to better weather storms. To do this, you must follow these steps:

  1. Compartmentalize failures,
  2. Predict machine injuries,
  3. Replace sick machines,
  4. Repeat.

It is an iterative process that reflects the constantly changing environment in which workers and their machines operate. By focusing on Steps 2 and 3, you can leverage preexisting data sources in your plant and use the power of predictive analytics to predict acute failures and quantify long-term machine health.

In closing, here is one more simile. If your plant is a population of sick and injured patients, how are you maintaining it? With predictive analytics, you have a means by which you can perform triage in order to build continuous improvement.

Will McGinnis is a Mechanical Engineer from Auburn University working at Predikto Analytics. At Predikto, he uses advanced machine learning techniques to leverage pre-existing data in asset intensive industries to predict failures, identify bad actors, and impact bottom lines. www.predikto.com