Risk is a fact of life for all areas of endeavor, whether at home or in the air. There is always some degree of risk in everything. The keys to managing these risks are based on the frequency of a failure or event occurring, and the impact of its failure. Risks/hazards can range from a frequency of ‘highly improbable' and ‘of little consequence' to ‘often occurring with catastrophic consequences' resulting in the loss of major assets and/or death of one or more people or damage to the environment. Managing these risks must include evaluation of the types of failures, their causes, determination of acceptable probabilities of failure, and the means of mitigation required to bring the unacceptable risks to an acceptable level. Managing risk requires a systematic evaluation and comparison of potential failures for a system or process. It should be noted that any well constructed risk analysis is dependent upon the procedures the analysis used for determination of relevant consequence and likelihood levels. The field of risk management/assessment and quality/hazard evaluation is large and varied. The authors' intent is not to teach this discipline but to refer to the methods in general as they apply to management decisions, and to suggest improvements in the current military aviation interpretation, the T-38 in particular.
The Impact of Failures
The evaluation of the impact of failures (criticality) is often the first step in determining a risk assessment. This risk assessment is accomplished in a variety of ways; one commonly used method is Failure Modes and Effect Analysis and Criticality (FMECA), a bottom up analysis. In conjunction with a FMECA, a top down Fault Tree Analysis (FTA) is frequently completed. An FTA is a formal logical analysis that shows, "...the combination of events that results in the occurrence of a specified system level event" (Dodson, 2002, p.162). At the very least, a team of experienced operators, maintainers, or engineers should make a formal assessment of the likely failure modes and impact of failures that would cause a loss of the system function or process. The primary factors considered by the team are often cost and severity of the loss (e.g. loss of an aircraft and two hundred people or loss of three days of production at $40,000 per day). Some non-tangible considerations may also be considered such as the loss of confidence in system by operators, maintainers or public.
The Frequency of Failure
The determination of the frequency of failure is usually accomplished at the same time the failure mode criticality is considered when analyzing risk. This determination of failure frequency is accomplished for fielded systems by evaluation of failure data, whereas use of generic testing data for similar components is derived from failure databases, such as MIL-STD- 217F (USAF, 1991) for electronics on designs not yet fielded. Failure rates are specified in many forms. Some of the more common ones are failures per flight hour, number of cycles per failure, number of failures per year commonly or mean time between failures (MTBF). The appropriate measure should fit the function of the system. An example of using an MTBF for landing gear may best be quantified in terms of failures per landing or takeoff.
Fidelity of the source of data should also be considered when assessing the failure rates. When it is known that a failure data system has gaps or duplications, conclusions should be expressed with a range of probability vice a concise number.
Determination of Acceptable Levels of Risk
Managers must determine what level of risk is acceptable and who has the authority to accept the risks that are defined. Generally, the higher the risk and impact (severity), the higher up the chain of authority the decision is made as to risk acceptability. A good example from military aviation is as follows: a T-38 trainer in the Air Force originally cost ~$800K and there are ~600 of them, replacement value today would be much greater1. Operators have a safe ejection capability for when an unrecoverable event occurs. For a safety critical failure, the acceptable probability of failure (Pacc)2 is defined as one event in 1,000,000 Flight Hours (1x10-6). However, an E-6 (TACAMO) that originally cost $142M should be considered differently. There are only 16 in the inventory with no safe egress for the often over 20 personnel crew. Furthermore, this aircraft is needed to coordinate essential communication during battle. The Pacc for the same event on this critical national asset should be much more restrictive than that of the T-38 (e.g.1x10-7 or more); however, the E-6 currently has the same Pacc as the T-38. This equality is due, in part, to the consistent misapplication of the Department of Defense Standard Practices for System Safety (MIL-STD-882D), (DoD, 2000), since its publication in 1969.
The information provided here is a suggested tool and set of definitions that can be used. Program managers can develop tools and definitions appropriate to their individual programs. (ibid, p. 17)
The statement above clearly affirms that the example given in the MIL-STD is only a suggested method to accomplish risk assessment. Every platform that the authors have encountered uses the same criticality limits shown in Table 1. The following note under the suggested mishap severity codes table in the MIL-STD makes it very clear that this should not be interpreted as a rigid standard:
NOTE: These mishap severity categories provide guidance to a wide variety of programs. However, adaptation to a particular program is generally required to provide a mutual understanding between the program manager and the developer as to the meaning of the terms used in the category definitions. Other risk assessment techniques may be used provided that the user approves them. (ibid, p. 18)
Table 1 - Suggested Mishap Severity Codes
As shown in Table 2, the recommended acceptable probabilities (failure rate) have been used for all aircraft, with no change since 1969.
Table 2 - Suggested Mishap Probability Levels
The matrix used to manage the qualitative risk numbers, as shown in Table 3, are also used in several forms (numbers or Alpha numeric) throughout the Department of Defense (DoD). This practice should be questioned when comparing assets of such vastly different value and function. One obvious problem with this approach occurs when managers require action for any catastrophic event, regardless of probability of occurrence. In such instances, there should be another probability category included of Extremely Unlikely with a higher Hazard number.
Table 3 - Suggested Mishap Risk Assessment Values
There are many methods to rank risk, but the best method is the one that meets management decision needs. Los Angeles, CA, city managers developed a means to rank disaster impact (Table 4) of such events as earthquake, flooding, tsunami, drought, etc. This tool was used to help the managers optimize use of available resources and services to protect the population. (LA, 2004, p. 1)
Table 4 - City of Los Angeles Hazard Risk Analysis
The criteria used were defined as follows:
Magnitude Physical and economic greatness (impact) of the event Factors to consider: • Size of event • Threat to life • Threat to property
1. Individual 2. Public sector 3. Business and manufacturing 4. Tourism
Duration The length of time the disaster and the effects of the disaster last Factors to consider: • Length of physical duration during emergency phase • Length of threat to life and property • Length of physical duration during recovery phase • Length of effects on individual citizen and community recovery • Length of effects on economic recovery, tax base, business and manufacturing recovery, tourism, threat to tax base and to employment
Distribution The depth of the effects among all sectors of the community Factors to consider: • How wide spread across the community are the effects of the disaster? • Are all the sectors of the community affected equally or disproportionably?
Area Affected How large an area is physically threatened and potentially impaired Factors to consider: • Geographic area affected by primary event • Geographic, physical, economic areas affected by primary risk and potential secondary effects
Frequency The historic and predicted rate of recurrence of a risk-caused event (generally expressed in years such as the 100 year flood)
Factors to consider: • Historic events and recurrences of events in a measured time frame • Scientifically based predictions of an occurrence of an event in a given period of time
Degree of vulnerability How susceptible is the population and community infrastructure to the effects of the risk? Factors to consider: • History of the impact of similar events • Mitigation steps taken to lessen impact • Community preparedness to respond to and recover from the event
Community Priorities The importance placed on a particular risk by the citizens and their elected officials: • Willingness to prepare for and to respond to a particular risk • More widespread concerns over a particular risk than other risks • Cultural significance of the threat and associated risks • Opportunity to mitigate for (ibid. p. 3)
The qualitative assessment above fits the city managers' needs but not necessarily those of military aviation managers, where availability to protect the country may be a factor and the value of the assets varies greatly..
Criticality may also include materially intangible considerations as shown in the following historical event.
Logistics support in 1907 through 1909, President Roosevelt had a fleet painted white and a required condition of dress for the crew, and sailed the fleet around the world. It had been determined that the appearance and general condition of the vessels would have a psychological impact that would help avoid confrontation. With the deferral of painting or cleaning the hulls of modern vessels, the projection is more of disrepair and fostering a negative projection. (Penrose. 2007. para. 6).
Much more than appearance or image as a critical consideration shown in this historical example, modern military managers must balance a host of other factors, including support equipment reliability, human error, and logistics support. Initially these managers must determine the degree of information accuracy necessary to make these decisions. Also, they must prioritize all identified critical events/elements.
Qualitative Risk Assessment is a tool for managers to prioritize risk in broad categories of criticality and probability. Qualitative assessment is often used in the design phase, where no failure data exists (Vose, 2006, p 6). The categories may be broad such as, failure ‘happens occasionally' and is of ‘marginal criticality'. These categories may have specific value ranges assigned, e.g. occasional is defined as 3 to 6 times a year per aircraft or has an MTBF of between 20,000 and 40,000 flight hours for a fleet of aircraft. When numeric ranges are identified, this analysis becomes what is often termed a ‘semi-quantitative' risk assessment. This method is the one most often used today in military aviation and incurs a less intensive analysis; however, a major drawback of this method is that it "...could hide critical information pertinent to the prioritization" (ibid., p. 333) within a category.
Quantitative Risk Assessment, sometimes called QRA in industry, refines the process further by calculating a risk priority number (RPN) using the quantified factors with which management is concerned. The main difference from qualitative risk assessment is that "...each variable is represented by a probability distribution function instead of a single value" (ibid, p 13). This numeric calculation allows a greater fidelity for ranking the risks. The overall purpose of this approach is to provide a prioritized list of failure modes that can or should be mitigated, within cost constraints, and a means to justify additional allocation of funds to avoid potential financial pitfalls. The major drawback for this method is that it relies heavily on accurate failure data and probability estimates. These probability estimates should also have a probability density function (PDF) associated with each failure mode to accurately represent time periods of concern. Accumulation of this nature of data can result in a very intensive and expensive analysis.
Management of Unacceptable Risks
Inherent in any management position is balancing resources with risk. Risk can be very high indeed when managing a nuclear power plant that could potentially affect many people catastrophically or an airliner that is carrying hundreds of people with the possibility of billions of dollars in litigation. Military aviation has somewhat different concerns, although those mentioned above are factors as well. Military aviation managers are concerned about the loss of expensive assets that have a long lead time to replace, if the assets even are replaceable. They are concerned about loss of life and injury as well as loss of the abilities of highly trained operator/maintainers. Readiness to protect our country is at the top of their charter; however, they must balance many factors when determining acceptable risks. Some factors are very difficult, if not impossible to gauge, such as the cost of aircraft availability when a mission is lost and another aircraft is used to accomplish it. Political reaction to a catastrophe, national confidence being shaken, or potential loss of project funding (e.g.V-22 Osprey) are just some of the impacts they must consider. An easier risk ranking methodology compares the cost of loss/repair to the cost of redesign or a change in maintenance strategy. Even these seemingly straight forward comparisons can be difficult because the logistics support may not be there or the lead time to purchase the item which mitigates the failure is extreme or not available. As a result, military managers place a very high reliability standard on their assets. This increased requirement is very expensive and often beyond budget. Management is essentially looking for the "most bang for the buck" with increasingly limited funding available. One drawback can be that quantitative numbers are more precise than the qualitative ones so there is a tendency to put more weight on them. Caution must be taken not to rely on these numbers exclusively, and consider the accuracy of the data used as well as the assumptions made in the quantification. For example the QRA sometimes yields a very low probability of failure, but the analyst may not have considered all the failure modes, human error, or the operating environment in which the asset is used. Additionally, if the assessment is based on the fact that this failure has not happened in millions of flight hours and is not likely to happen, however the component is actually at the end of its useful life after fatiguing for many years, the results could be disastrous. Care must be taken to consider the way the failure happens, its failure characteristics (e.g. wear-out or infant mortality), whether there are clear indications of impending failure and whether there are adequate backup or safety systems in place. Once the QRA is accomplished, an on going assessment should be established that revisits the potential failures at a reasonable frequency (e.g. a 2 or 3 year cycle). A very thorough treatment of these tradeoffs is addressed in the Air Force System Safety Handbook (USAF, 2000, para. 3.6).
Systematic Processes for Management
Once an improved safety program is established for each major platform, risk management for the entire Air Force must be considered at the highest level to make decisions (Secretary of the Air Force, Joint Chiefs of Staff, Congress, etc.). Risk assessment at these levels is usually conducted by an operations research division for acquisition, force management, etc.; however, few such analysts are trained in the science of failure mode risk mitigation and their management. The area of risk and hazard management is continually improving and military management must keep abreast of improvements and their implications.
Management of military aviation assets is very challenging, because they have a constantly changing set of political, funding, and tactical pressures that must be balanced to accomplish their assigned objectives. One of their many considerations must be safety and the risks associated with management decisions. There are guidance and directives in place to accomplish risk management; however, from this author's research and experience the current guidance is being misinterpreted. There is a need to have each platform develop a risk management plan that not only fits their unique mission and asset value, but also reassesses the limits of the acceptable probabilities of failure based on current funding, environment, and operational demands. At upper levels of management, there should also be an overarching risk assessment tool that is based on the criticality of the various missions of the military services as a whole. This overarching risk assessment, in part, would help supply the Pentagon managers and Congress with a tool to more accurately gauge where resources should be focused as well as the risks involved with their decisions.
It is recommended that the System Safety Commands commission the development of very clear metrics to first guide each of the major commands and then to develop a means to integrate them into the overarching decision matrix to improve decisions at the top of management. This could then be included in the current safety guidance. The approach may be as simple as ranking each asset by the cost of each aircraft, its criticality to national security, its number of assets available and its ability to fulfill its current mission. These factors could then be mathematically normalized to a 0 - 1 ranking and multiplied by each other with the results ranked in ascending order. Figure 1, developed by the authors, is one such example of a global assessment matrix which uses the military HRI index as it currently stands, using the value of the asset and the percentage of the fleet that one aircraft represents.
Management should frequently revisit risk factor priorities and determine if there are new constraints or limitations on current missions. These in turn should be promulgated to each of the major support and tactical commands so they might better support the over all goals of the President and Congress. The metrics developed at the higher levels of management could then be used at local levels to better understand risks in the overall picture and thus more accurately manage their risks. For the T-38 specifically, and other platforms in general, it is recommended that a quantitative method be adopted such as is described in MIL-STD-1629A (DoD, 1980, task 102). A sample solution is shown in Table 5 below.
Once these numbers are calculated, the failure modes can then be sorted and combined to identify the major drivers of risk for that system. The criticality number can also be calculated at the aircraft level to assess areas that need immediate attention to safely maintain the aircraft and mission. Acceptable criticality levels must be determined by management and a decision made as to whether the failure data used in the assessment actually occurred during operation rather than all corrective actions. This author recommends the use of actual operational failures, which would provide a more accurate assessment of the risks involved. Priority failures should be considered first with safety critical ‘single point failures' as the highest on the list. In conjunction with the qualitative analysis, a Fault Tree Analysis (FTA) is also recommended to refine the probabilities of failure and impact for an entire system or platform. The FTA will combine probabilities to determine the overall probability of failure. Reliability Block Diagrams (RBD) is also useful in calculating probabilities and availability. Additionally the calculation of a Risk Priority Number (RPN) for each failure mode is another quantitative approach for management to consider adopting. This approach has the same affect as calculating the criticality number mentioned previously and is somewhat easier to manage and understand. First, categories and ranges are established, and then critical factors classified. In each category, a number is assigned and then multiplied by the other category rankings. An example of the criteria, effects and ranking are shown in Table 6. The criteria listed are proposed by the authors for evaluation by management to provide a higher fidelity risk management tool.
Once the categories and ranges are deemed acceptable, calculation of the RPN is accomplished for all the failure modes of a system by multiplying each factor (e.g.10x6x4x2=480). The following example shown in Table 7 is taken from an A-10 Environmental Cooling System (ECS) analysis using the values in Table 6.
By using this method, the managers are able to rapidly rank priorities and focus on immediate and long term needs. The managers decided in this case can address any RPN number over a specified level as critical. The T-38 would have to make a similar decision after viewing the results of the analysis. These critical failures can then be mitigated by design change, preventative maintenance tasks, changes to processes, etc., or to justify the funds needed for correction. Because these numbers are graduated in tens, each category adds another order of magnitude to the calculated number once multiplied. These results could also be compared on a log scale matrix to show overall numerical relationship or by simply taking the natural log of the RPN to normalize.
For the T-38, if 5 categories are used and graduated in tens, it could approximate the values used in the current 1 to 20 hazard risk matrix with greater fidelity.
This later method of calculating the RPN is the primary recommendation that these authors would make, followed by the criticality number generation from MIL-STD-1629A as the second choice. The last choice is to continue the current semi-qualitative approach with new metrics assigned (e.g. changing a category 1 failure value from $1M to $10M).
Department of Defense. (2000). Standard Practice for System Safety. MIL-STD-882D, Air Force Material Command, Wright-Patterson Air Force Base, Ohio.
Department of Defense. (1980). Procedures for Conducting a Failure Mode Effects and Criticality Analysis. MIL-STD-1629A, Washington, D.C.
Los Angeles City Managers' hazard risk plan. (2004). Unidentified internet pdf source. Link lost
Penrose, H. (2007).The Motor Diagnostics and Motor Health Newsletter. February 1, 2007. Retrieved from http://www.motordoc.net/toc.htm 1 Feb 2007
United States Air Force. (1991). Military Handbook; Reliability Predication of Electronic Equipment. MIL-HDBK-217F. Griffiss Air Force Base, Rome, New York.
United States Air Force. (2000). Air Force System Safety Handbook. Kirtland Air Force Base, New Mexico.
Vose, D. (2006). Risk Analysis a Quantitative Guide (2nd ed.). West Sussex, England: John Wiley & Sons Ltd.
Lynnwood Yates is a Senior Reliability Engineer with Wyle Laboratories in Warner Robins, GA., responsible for overseeing all reliability projects with the Air Force customers. He conducts RCM, component and risk analysis and is a Certified Maintenance and Reliability Professional (CMRP). Contact information: 1-478-923-0500 or email
1 Approximately $4.6 million in terms of today's dollars, utilizing a conservative 4% rate of inflation
2 Pacc - acceptable probability of failure is defined as the amount of risk that management is willing to accept in terms of probability and still allow use of the asset for which it is designed. Typically expressed as a negative exponential such as 1 X 10-6 one event, but may be as MTBF.
by Lynnwood Yates and Daria Lee Sharman, Wyle Laboratories, Inc.
“R.A.I.” the Reliability.aiTMChatbot
You can ask "R.A.I." anything about maintenance, reliability, and asset management.