Reliabilityweb Confessions of a Risk Matrix Zealot

Confessions of a Risk Matrix Zealot

Merriam-Webster's Dictionary defines risk as the “possibility of loss or injury.” The International Organization for Standardization’s ISO31000 defines a risk matrix as “a tool for ranking and displaying risks by defining ranges of consequences and likelihood.” With these two definitions, you can simply define a risk matrix like the example in Figure 1.

Figure 1: Risk matrix example

The problem with this simple approach is context. Can it really be said with confidence that a rare but catastrophic event is as low a risk as a negligible but frequent event? No, because the matrix in Figure 1 lacks meaningful context. Experience shows that different risk contexts require different risk visualizations.

In past years, you might have thought a risk matrix was the ultimate tool in evaluating criticality and risk for asset management. You also might have thought that all categories of risk could be evaluated congruently in a common risk matrix across an organization. In fact, in the past, an entire product line of asset management tools may have been led upon this principle, thus forcing many subject matter experts in asset reliability, mechanical integrity and process safety to attempt to align to a common risk matrix. Very spirited discussions certainly ensued, usually turning into debates, with alignment not always the result. As poet William Blake once said, “The fool who persists in his folly will become wise.” Indeed, many have learned this lesson, with age and experience comes wisdom and humility.

On the other hand, the risk matrix is a very good tool. It provides a visual representation to communicate risk concepts simply while providing a framework for prioritization. It can be customized for specific organizations with specific definitions of risk. It is no wonder industry so quickly wants to adopt it to understand risk across organizations. Unfortunately, a risk matrix is not a good tool for comprehensive risk quantification, nor its mitigation. It is limited to discrete ranges of probability and consequence. It is often oversimplified, subjectively qualitative and does not consider how risk changes over time.

Let’s consider a couple of common categories of risk in the asset intensive industry of safety and operations.

First and foremost, let’s discuss safety risks, which are focused on consequences from minor injury up to very severe injury, including fatality. Naturally, the protection of people is of utmost importance. A safety risk assessment considers events that could lead to personnel harm or injury. For example: Could the event occur? Can the event cause a chemical leak and/or fire? Is there a possibility that someone could be exposed to the leak or fire? When determining a risk matrix, the consequence categories can scale reasonably from minor first aid to fatality. But in this context, your probability scale actually factors in multiple probabilities for every row (i.e., event, leak, fire and exposure). Thus, the matrix probability could exponentially scale from an occurrence of once per year to once every 10,000 years. It can be difficult to practically understand something occurring only once every 10,000 years, but this makes total sense to a safety engineer or, likewise, to a mechanical integrity engineer. They are responsible for mitigating severe consequences that should never happen, which include fatality, loss of containment, fire and hazards to the environment.

Safety and integrity engineers use methods of risk mitigation dictated by process safety management standards, such as hazards analysis, safety integrity systems (SIS), layers of protection analysis (LOPA) and risk-based inspection (RBI), as well as compliance to jurisdictional standards. These methods are relatively complex methodologies that drive recommended and mandatory actions as part of an overall safety and integrity plan. While you can use a risk matrix as a visual representation of the results of these methods, you cannot easily use the same risk matrix to represent operational risk.

For the context of this article, operational risk is defined as the risk of unplanned production downtime and associated costs, which may include maintenance, overtime, lost production, rework and scrap. Often, unplanned production downtime is caused by asset failures. Reliability engineers seek to mitigate the risk of these asset failures, especially those that impact production. There is typically an immediate cause and effect of a critical production asset failure. The asset fails and production is immediately impacted. Imagine assets that failed every 10,000 years, reliability engineers would not be needed to determine how to improve those failure rates! That is the difference of probability scale for reliability engineers. They are dealing with asset failure consequences of production loss and costs that could occur multiple times a year up to once every several years. It is a completely different context to that of the safety or integrity engineer.

Thus, think of a risk matrix as just a defined set of intersections of probability and consequence chosen to represent a category or context for risk. As previously described, contexts for safety and operations are different, thus it is reasonable for risk matrices to differ. Remember, it is a tool for visualization and prioritization more than assessment and mitigation.

Back to the reliability engineer focused on operational risks, another key aspect to consider for risk assessment is mission time. The mission time for an asset or a group of assets, such as a production unit, can be thought of as the time between major shutdown events where restorative maintenance or asset replacement is performed. However, if a risk matrix is used to assess risk in this context, the assessment will be limited to just the ranges defined for each intersection and it will not have any context of risk over time.

For the reliability engineer, it is best to assess risk with a more quantitative probability estimate over a mission time multiplied by an estimate of the overall cost of the potential failure to the business. This allows the engineer to focus improvement efforts on those assets that will return the most value to the business. These include efforts that leverage reliability methodologies, such as failure mode-based strategy development, maintenance optimization, reliability modeling, asset health monitoring and advanced analytics.

A better risk assessment method is to estimate a failure probability quantitatively for the asset. This can be easily done with an estimate of failure rate experienced (or expected) combined with a desired mission time. A simple calculation can be used to represent the probability over time, such as a random Weibull or exponential distribution. Plotting the distribution will provide a simple visual over time.

Figure 2: Failure probability over mission time

Consequence also can be estimated based on the overall cost of failure, which encompasses all costs, including repair costs and production losses. By combining this cost-based consequence with your failure probability, the overall risk for your mission time can be easily calculated.

Figure 3: Overall risk calculation

Now, if you have a set of production assets to evaluate for a system or unit, you could compare them across a mission time between shutdowns. A comparison might look like the chart in Figure 4; note the riskiest asset is not always the one to focus on during the mission time.

Figure 4: System-level risk assessment

Assets with a higher probability of failure, but lower cost, might need more attention during the mission time than an asset with a much lower probability of failure, but much higher cost. This is a risk comparison tool to evaluate assets in a specific context over a specific time. Also, by estimating failure probability and cost of failure, you are better equipped to leverage other reliability methods to determine the best course of action to improve assets.

For those of you that have assessed asset criticality with a qualitative risk matrix approach, take heart because all is not lost. You are a step ahead. Consider your risk matrix intersection an initial assessment that can be leveraged in the more detailed assessment in Figure 4. You can simply use your ranges as estimates of failure rates and costs to plug into the same formulas and then adjust as needed.

With this comparison, the next steps can be taken to mitigate risk and capture its associated value to the organization. Several mitigation methods can improve an asset’s performance or reduce the cost of unplanned failures, including addressing problems, such as:

Asset is unreliable with low inherent mean time between failures (MTBF);
Asset strategy does not cover or does not effectively cover all failure modes;
Asset strategy interval is either too low (doing too much)or too high (doing too little);
Asset strategy is not being executed properly;
Asset strategy is not addressing the root cause of the failures.

In summary, remember an initial risk assessment is just the starting point to any form of active risk management. You must use the assessment to drive prioritization and improvement of asset performance in support of the operation of your business. A risk matrix is a tool that can be used, but a quantitative approach offers a more technically complete assessment. The better the assessment, the better decisions you will make regarding risk mitigation.