Tragically, a US sailor lost his life as a result of the accident. More lives could have been lost however. The bow of the ship lost almost all ballast tanks, significantly degrading buoyancy and the ability to recover from 500 feet below the ocean's surface. However, due to the heroics of her crew, and the ruggedness of the vessel, the boat did recover from the head-on collision. That this boat did return from the depths is somewhat of a miracle and it should reflect well on all those that support submarine design, construction and operations. One should also consider the Navy's maintenance and reliability strategies over the past years to preserve and enhance system integrity. No one single factor saved the San Francisco, and as people in all industries study risk, as it is associated with their plants or platforms, the USS San Francisco serves as an example of what can go wrong.
In 1995 the Submarine Maintenance Engineering, Planning and Procurement Activity (SUBMEPP) - a Kittery, Maine field activity for Naval Sea Systems Command (NAVSEA) - began classical RCM analysis for non-nuclear submarine systems and equipment. SUBMEPP, as NAVSEA's technical agent for submarine non-nuclear life cycle maintenance planning, provides maintenance products and engineering services to the fleet. In the decade that has followed, much has been learned and discovered through the application of Reliability-Centered Maintenance. This paper shall focus on some of the surprising results for the US Navy realized through the utilization of RCM, and shall also reflect on the importance and proper application of RCM for all industries. I will also make some observations based on my tenure as SUBMEPP's RCM program manager from 2001 to 2005.
The US Navy has had great success conducting RCM, and one testimony to that is that in an era of shrinking budgets and tight funds they are still doing it - more than ever. As you may already know, RCM is a technical discipline to preserve required system functions, and safe plant or platform operations, at minimum cost. In the case of the USS San Francisco, the systems necessary to allow recovery of the ship worked. Backup systems worked as well, not only blowing emergency high pressure air, but also lower pressure air continuously into the remaining ballast tanks to raise and keep afloat the severely damaged vessel.
Stanley Nowlan and Howard Heap, under tasking for the Department of Defense, wrote the landmark publication entitled "Reliability-Centered Maintenance" in 1978. One of their quotes resounds as we reflect on the San Francisco. "Not every critical failure results in an accident; some such failures, in fact have occurred fairly often with no serious consequences. However, the issue is not whether such consequences are inevitable, but whether they are possible".
Their quote was added as emphasis to SUBMEPP's RCM handbook this past year. Also, renewed emphasis was placed on emergency equipment by adding to the handbook: "For protective devise, emergent use or safety equipment failures, the situation for which the protective device, emergent use or safety related equipment was intended to operate must be assumed. The effects of this worst case situation should be judged." So, if one were designing or maintaining a system that blows air into submarine ballast tanks to ascend the ship, one must consider such emergency situations.
Another emphasis was added to SUBMEPP's RCM handbook and that is: "Traditional RCM methodologies generally do not consider the combination of double evident failures as having safety effects, because most platforms, systems etc. can be secured either immediately or within hours to prevent or mitigate second failure consequences. Naval ship's, however are not as fortunate. Operational ships cannot immediately obtain the security of port or dry dock. Moreover, ship's crew cannot repair all failures. Therefore analysts must consider whether the redundant, backup or complementary component(s) will be available during the period of failure." Simply put, the submarine process asks the analyst to consider whether they have a high degree of confidence that in case of failure, will the redundant, or backup system that is intended to mitigate the effects, be available. Items of consideration are whether the mitigating system or equipment is subject to hidden failure, is inherently unreliable, or has extended repair times.
SUBMEPP developed a criticality matrix to establish what level of risk is tolerable and what risk should be mitigated. Failure Modes and Effects Analysis is essentially a risk assessment. Specifically, those failure modes that are determined to be critical, shall be analyzed for the development of planned maintenance. Criticality is the combined influence of the severity of failure and its probability of occurrence. For each failure mode, one of four severity levels is chosen: Catastrophic (severe personal injury or loss of ship), Mission (the loss or curtailment of ship's mission), Marginal (a failure that may impair system operation, but not cause loss of mission) and Minor (a failure that does not have a significant effect). It is noteworthy to mention that realistic catastrophic or mission failures are deemed critical regardless of their probability. And minor failures are deemed critical if they are expected to occur at a rate of three years or more often because they are considered a maintenance burden. Non-critical failures are tolerated. Critical failures are evaluated to determine if maintenance can prevent them.
Each critical failure mode is evaluated to determine applicable and effective maintenance requirements. The criteria for maintenance are dependent on the severity of failure and the evidency of failure. If the failure is safety related, preventive maintenance is required and must reduce the risk of failure to an acceptable level (essentially zero chance of occurrence). If the failure is mission critical, preventive maintenance is desired if cost effective relative to the cost of mission loss. If the failure is marginal, then the evaluation is more of a business case - the cost of maintenance must be less than the costs associated with not accomplishing it. If the failure is hidden and safety related, then maintenance is required to reduce the risk of multiple failures to an acceptable level. Once these criteria are established, the analyst will evaluate whether servicing tasks, condition directed tasks, time directed tasks or failure finding tasks are applicable and effective in preventing failure.
Applicability and Effectiveness of Time Directed Tasks
In 1961 a joint task force consisting of FAA (Federal Aviation Administration) and US airline company representatives reported its findings on the effect of scheduled maintenance and aircraft reliability. They stated "In the past, a great deal of emphasis has been placed on the control of overhaul periods to provide a satisfactory level of reliability. After careful study, the Committee is convinced that reliability and overhaul time control are not necessarily directly associated topics." Further studies that also supported this precept and efforts to determine just what does maintain reliability, led to a new discipline which eventually became known as "Reliability-Centered Maintenance" - a set of principles and methodology to objectively determine the appropriate type and level of maintenance to maintain required asset functionality.
Inherent to most RCM seminars is the presentation of the Age and Reliability patterns displayed in Figure 1.
Figure 1. Age and Reliability Pattern Categories
The graphs depict equipment failure rates (y-axis) vs. service time (x-axis). These curves and the associated population percentile applicabilities have helped dispel the long held notion that equipment reliability fits the so-called "bathtub curve". The bathtub curve theory, which postulates that equipment suffers higher than normal rates of failure early in its life (infant mortality), followed by lower and steady rates of failure for a time period, with an eventual wear out age at some defined time period, represents only 3-4% of sampled equipment populations according to three studies accomplished by United Airlines, Broberg (1973) and the U.S. Navy (1982 MSP). While the majority of sampled equipment populations did experience infant mortality, in general, 90% of the population did not experience an identifiable wear out period. The Navy results are an exception to this generalization. 20% of the Navy population did experience an identifiable wear out period. This has been attributed in part to the corrosive marine environment that affected many of the sample population. Also noteworthy was the finding that the population majority in the Navy study did not suffer infant mortality. This has been attributed to the fact that navy vessels, systems and components are thoroughly tested and "run in" prior to being put into service. Infant mortality certainly exists, but many instances of it are not on the "radar screen". While no one should accept these findings at face value without reviewing them in the context of each individual study, these curves have been used to demonstrate the precept voiced back in 1961 - that random failure predominates.
In 1998, SUBMEPP developed the capability to generate Age and Reliability profiles utilizing maintenance data imported from the Navy's 3-M OARS (Maintenance and Material Management Open Architecture Retrieval System). This provided the organization a new means to objectively measure the effects of planned maintenance to engineer optimal maintenance plans. In 2001, after three years of generating Age and Reliability profiles, SUBMEPP reported that the 1961 finding holds true. In the majority of cases there was no relationship between overhaul time and reliability. Random failure predominated.
In SUBMEPP's study, Age and Reliability graphs were generated for fifty-two submarine component types. These components were as complex as communications equipment, refrigeration plants, turbine generators and towed array handling equipment. Simple, but vital components were analyzed as well such as hull and backup valves, gas regulating valves, steam isolation valves and ship's whistle. Air dehydrators, switchboards, circuit breakers, hatches, compressors, pumps, condensers, motor generators, torpedo tubes, atmosphere control equipment, and propulsion shaft bearings are all examples of the type of equipment that comprised the study's fifty-two component sample.
71% of the components profiled by SUBMEPP experienced a steady state of random failure after their early years of operation. Some of the components in this group did experience infant mortality or short-lived increases in their rates of failure. This compared generally well with the UAL (89%), Broberg (92%) and MSP (77%) studies. As mentioned previously, UAL and Broberg were based on aircraft. MSP and SUBMEPP were based on navy vessel components and so it is logical that SUBMEPP's results parallel MSP much closer than UAL and Broberg.
SUBMEPP's age and reliability characteristic findings are categorized in figure 2 based on sample population proportions. Only 12% of the sample supported the traditional belief that equipment operates at a steady state of reliability and then wears out at an identifiable time period. The remaining 17% that demonstrated age related wear out did so at an increasing but steady rate over their life span.
The differences between characteristics B and C may possibly be explained by the complexity of the component. The simpler the component and the fewer failure modes attributed to it, the more likely that sudden wear out occurs, if indeed there is an age and reliability relationship. Interestingly enough, all of the components in the sample that exhibited characteristic B were either valves or valve like in function. There was one component that matched characteristic A and, being an electro-mechanical device with numerous valves, it suffered predominately electrical type failures in its early years and predominately valve related failures in its later years.
Characteristic C components tended to be more complex then characteristic B. Complex components have multiple modes of failure and those individual modes may fit characteristic B when viewed in isolation. However wear out patterns among these individual modes tend to occur at different times and when viewed in the aggregate, the overall failure rate pattern matches characteristic C.
Ideally, life renewal tasks are prescribed when a characteristic B situation occurs - just prior to the upswing in the probability of failure. Life renewal tasks might still be applicable and effective in a characteristic C situation. If, for instance, it is demonstrated that a failure rate beyond a certain percentage is undesirable, a maintenance task at that point should return the failure rate to that found at the x-axis origin.
8% of SUBMEPP's sample population exhibited infant mortality characteristics. This differs significantly with the earlier findings of UAL and Broberg. As mentioned previously, navy vessels go through a lengthy test period prior to entering service. Infant mortality likely exists however those failures are not captured in 3-M OARS during those test periods. SUBMEPP's infant mortality statistics differ from MSP as well. 32% of MSP's sample suffered from infant mortality. Differences may be caused by the type of equipment analyzed. The majority of SUBMEPP's components fitting characteristics A and F were more electrical in nature, than mechanical. Electrical devices are more prone to sudden failure early in their life. The majority of components in SUBMEPP's sample were mechanical in nature, however, and that may differ from MSP and the other studies.
Platform differences may contribute as well. SUBMEPP's results are derived from a sample of submarine components and MSP's results were derived from a sample of surface ship components. Corrective maintenance accomplished during a submarine overhaul is not captured by 3-M OARS. Not until the boat is delivered to the fleet is corrective maintenance reported to 3-M OARS. Jack Nicholas, who pioneered RCM at NAVSEA back in the 1970's, has written about these differences as well and he believes that the submarine community's detailed overhaul procedures have mitigated the occurrences of infant mortality.
Maintenance Plan Changes
The majority of components analyzed by SUBMEPP did not demonstrate an age and reliability relationship and consequently, many existing time directed component overhauls have been deleted from class maintenance plans. These deletions have allowed the Navy a substantial cost avoidance for submarine depot availabilities. The term avoidance is used here because one can not project beyond the age span of study to predict future probabilities of failure. Components may or may not experience failure rate increases and that will be a future determination when maintenance strategies for these components are revisited. SUBMEPP's review of components does shed light on the effectiveness of many overhaul periodicity extensions made in the early 1980's however. The majority of components that fit non-wear out characteristics D, E and F once had overhaul periodicities half as long.
The RCM approach is to extend or eliminate overhaul periodicities in the absence of an age and reliability relationship. The decision whether to extend the periodicity or delete the action entirely often depends on the consequences of failure. Extensions are more appropriate for components with safety related failures for which no effective condition monitoring techniques have been devised. Deletions are more appropriate for non-safety related components. Maintenance plan strategies should not be based entirely on failure rates viewed at the equipment level. Individual failure modes should be viewed in isolation as well to determine if an age and reliability relationship exists. If so, a surgical maintenance approach may be appropriate where only a piece part or subassembly is replaced.
The portion of components analyzed by SUBMEPP, that did demonstrate an age and reliability relationship, was further analyzed to determine if a time directed maintenance task was appropriate. For non-severe failures, where there are no additional costs attributed to failure beyond material and labor to repair the component, a fix-when-fail strategy may still be more cost effective. Labor and overhead cost differences must be taken into account. And if there are mission or collateral damage costs associated with failure, condition monitoring can sometimes be substituted for a time directed task. Condition monitoring must detect potential failure conditions and allow a known and sufficient time period for adequate correction. A more surgical maintenance strategy may be appropriate as well. Pareto's rule that 80% of the problems are generally caused by 20% of the actors, has been validated by RCM analyses. Maintenance professionals should concentrate on the few "bad actors" which degrade reliability.
Soon after the development of the feedback data analysis application, a SUBMEPP combat systems engineer analyzed Trident class torpedo tubes. Torpedo tubes are comprised of barrels, breech and muzzle doors, latches, linkages, slide valves, rotary actuators, power cylinders, safety interlocks, indicators and numerous other sub-assemblies. The class maintenance plan for the torpedo tubes included a time based maintenance action to replace hydraulic power cylinders every 160 months. Each torpedo tube has five cylinders. Functional failures for these components are mission critical as they render a tube inoperable, or degrade performance to an unacceptable level. Even though there are multiple torpedo tubes, a full complement of operational torpedo tubes is deemed necessary for readiness. Two of the power cylinders operate the torpedo tube slide valve. Over half of the observed discrepant conditions associated to inoperability of the slide valve were attributed to the hydraulic power cylinders and only one of those discrepancies was judged to be a functional failure. The predominant mode of failure was external leakage of hydraulic fluid and as previously stated, these were judged to be non-functional failures. They were potential functional failures if left untreated. Figure 3 displays the Age and Reliability curve for the slide valve power cylinders. The failure pattern is random with no correlation with time. In fact, the regression line has a slightly negative slope of 0.0003X. There is no evidence indicating that the valves should be replaced at 160 months. Moreover, the engineer found that existing condition monitoring tasks were applicable in monitoring and maintaining system health. Periodic pressure and cycle time tests are able to detect degradation before performance is compromised, and allow sufficient time for repair or replacement of cylinders. Age and reliability findings for the remaining power cylinders were similar. The engineer deleted the requirement to replace torpedo tube power cylinders at 160 months and this lifecycle cost avoidance for Trident class submarines was determined to be $2.3 million. If the current reliability trend holds consistent over the submarine lifecycle, that avoidance will be actual savings.
The US Navy has found condition monitoring with specific condition directed tasks much more cost effective and a better strategy for maintaining fleet readiness. Using the principles of RCM, the Navy has been able to extend overhaul periods, and significantly reduce the amount planned work during overhaul availabilities. In 1994, US Navy attack submarines spent 22% of their lifecycle in depot, being repaired or overhauled. In 1993, that percentage was reduced to 17%. Today, attack submarines spend only 12% of their lifecycle constrained to depot which has allowed the US Navy to maintain the same levels of presence and readiness with fewer ships. Optimization, however, requires periodic monitoring of maintenance plan effectiveness, and this is generally achieved through the review of material condition and equipment failure data in the context of an RCM analysis. Sometimes maintenance strategies must be altered when reviewed against the course of time.
RCM as a Knowledge System
Finally, RCM is much more than a means to develop maintenance plans. It is a knowledge system that allows continuous refinement of planned maintenance strategies. Why a maintenance task was eliminated, created, or its frequency extended will be documented in an RCM analysis. During the industrial revolution, great knowledge was gained through trial and error, and machines and mechanisms were developed, refined and improved as time passed. This process was for the most part continuously progressive, with minimal regression regarding performance, durability or reliability of those mechanisms. One of the chief reasons is that the knowledge was retained in the hardware itself. One could see it, touch it, and if someone was completely unfamiliar with it, given enough time, much of the knowledge that went into making it could be reverse engineered. Gains in productivity were essentially cast in stone with each design improvement. As highly developed nations move to less of a manufacturing base, and more to service and knowledge worker economies, there is great potential to regress as workers in such industries change jobs. Their work and knowledge is usually not captured and institutionalized like a machine maker's is. How an asset works and operates is preserved in the physics of the hardware. How it is produced and properly maintained is much more perishable. Therein lies a problem, and the necessity to build knowledge systems. RCM analysis - a top down logical progression, organized by plants, platforms, systems and components - provides an opportunity to capture critical facts, assumptions and reasonings. Further augmenting the process and producing repeatable maintenance procedures increases effectiveness even more. A healthy RCM program will yield a corporate technical library for the benefit of future workers, and assist in the retention of corporate knowledge to allow gains in productivity. Treat RCM as a long term living program. True optimization requires a periodic look in the rearview mirror.
Article submitted by Timothy Allen, CMRP, Granite Reliability Group, LLC
Excerpts of this paper are an update of a paper written by the author while employed by the US Navy in 2001, "U.S. Navy Analysis of Submarine Maintenance Data and the Development of Age and Reliability Profiles" and published by the American Society of Naval Engineers.
American Management Systems, Inc., "Age Reliability Analysis Prototype Study", N00024-92-C-4160, November, 1993.
Drew, C., "Adrift 500 Feet Under the Sea, A Minute was and Eternity", New York Times, May 18, 2005.
Michal, J. "Reliability Modeling and Estimation Using U.S. Navy 3M Maintenance Data, Naval Postgraduate School Monterey, California, September, 1995.
Nowlan, F. and H. Heap, "Reliability Centered Maintenance", MDA 903-75-C-0349, December, 1978.
Tim Allen is the senior member of Granite Reliability Group, LLC. Tim is the former RCM Program Manager at Submarine Maintenance Engineering, Planning and Procurement (SUBMEPP), a Naval Sea Systems Command (NAVSEA) field activity located in Kittery, Maine. SUBMEPP is the engineering and planning authority for submarine lifecycle maintenance. Tim worked at SUBMEPP for twenty years and has been instumental in Navy RCM since 1996. Tim represented NAVSEA's submarine RCM program, trained system engineers in the principles and methodologies of RCM and worked collaboratively with them to engineer cost effective submarine class maintenance plans for all Navy submarines. Tim received a Bachelor of Science in Mechanical Engineering Technology at the University of Maine in 1986. In 1997, he received a Master of Business Administration degree at New Hampshire College.