Reliabilityweb The Statistical Outliers are in Control of Asset Management

The Statistical Outliers are in Control of Asset Management

Categories of Outliers

The repairable component outliers fall into four general categories:

“Bad from Birth” – there is a flaw that occurred during the manufacturing process, which usually manifests itself early in its first entry into service.

“Chronic” – the classic “lemon”, whereby a component has one problem after another in short succession, generally early in its life cycle.

“Geriatric” – different failures occur one after another, toward the end of its life cycle.

“Rogue” – components that have performed reliably, until they develop a failure mode that is never repaired during repair shop visits. This condition can occur early or late in its life cycle.

Categories 1-3 are relatively easy to identify and resolve. The last category (Rogue) is very difficult to identify, more of a challenge to control, and has the greatest negative effect on asset management.

The Rogue Outlier

The Rogue outlier is an individual repairable component that

Repeatedly experiences short in-service periods
Each time it is installed in a mechanical system, it creates the same system malfunction
When it is removed from the mechanical system, the malfunction is corrected
Its failure cannot be identified and resolved by the shop’s repair or overhaul procedures

The primary reason a component becomes a Rogue is because shop bench tests do not address 100% of the component’s operating functions, characteristics or in-service environment. Additionally, the bench test is crafted to identify anticipated failures – things that are expected to fail. When a component experiences a failure that was either unaddressed or unanticipated, a Rogue is born.

There is a phenomenon that occurs, similar to a Darwinian “natural selection” process, which amplifies the negative effect of the Rogue. It is actually the opposite of the “survival of the fittest” process; it is a case of “survival of the worst”. Rather than the best being sorted out, the Rogues are sorted out to the most disadvantageous position in the asset management process.

“Natural Selection” Phenomenon

The following depictions demonstrate the mechanics of this “natural selection” phenomenon. Initially, it starts with a pristine condition in which the component spare pool and the in-service population are comprised of serviceable (Good) components that function as expected (Figure 1).

Maintenance Effectiveness

As the chances of using Rogues from the spare inventory to resolve mechanical system faults increase, there is a good probability that the initial mechanical system problem will become chronic.

“Real Life” Case in Point:

There is a system that allows air to be vented to the atmosphere, comprised of a control unit, sensing units A through C, a control feedback sensor and the vent valve. A system malfunction occurred which caused the valve to intermittently stop in mid-position during high operational demands. The maintenance technicians could not duplicate the fault, so they replaced the control unit as the most likely part to cause this problem.

The problem repeated – the valve stopped in mid-position again during high operational demands. Since the control unit did not resolve the problem, the venting valve was replaced next, which required considerable system down time.

Now when the system operated during high demand periods, the valve intermittently oscillated open and closed, when it should remain in a fixed position. Again, the root cause could not be determined. However, since this new problem surfaced immediately after the installation of the valve, the valve was replaced again in the assumption that it was defective from stock. The system was down again for a considerable amount of time during this second replacement, and yet the oscillation problem repeated.

Next, the control feedback sensor was replaced, to no avail. Maintenance then checked the interconnecting wiring for an intermittent malfunction, which may have been caused when the valve or sensor was replaced. Several maintenance technicians spent hours checking the wiring for faults, finding no problems.

As a desperate measure, the control unit was replaced again. From that point on, the system operated normally throughout all ranges and operational demands.

Fault analysis

The root cause of the initial system malfunction (when the valve would stop during operation) was a faulty valve. The control unit that was first installed was a Rogue that would cause the valve to intermittently oscillate during high operational demands. However, the Rogue failure would not show until a serviceable valve was installed, since the originally faulty valve would stop during operation, which prevented the oscillation from occurring.

This chain of events caused extensive system down time and consumed a tremendous amount of maintenance resources and spare assets.

Spare Inventory

In addition to the challenge for the asset management process to maintain the logistic support of chronic mechanical system outages and ineffective maintenance efforts when a significant Rogue population develops, the consumption of spare inventory will not be in a linear or predictable manner. There will be sporadically high usages of the inventory as multiple parts are replaced in short order until a “good” component is obtained.

If the spares fall below critical levels often enough, there will be periodic increases in order to accommodate the high demand. The introduction of new serviceable parts into the spare pool lowers the ratio of Rogue to non-Rogues, so the chances are better that a “good” part will be selected the first time. The chronic problems will lessen and spare demand will subside for a while, until more Rogue components develop and again pollute the spare pool. As the ratio of Rogue to non-Rogues increases again, the spares will fall below critical levels and more assets will be acquired to satisfy the high and unpredictable usage. Over time, as this scenario repeats, there will be abnormally high levels of spares to maintain the mechanical system operation.

Case in Point A mechanical system was comprised of a computer, controller, actuator, and a number of sensors. There were 40 of these systems in operation. For the initial years of operation, a spare inventory level of 6 computers was all that was required in order to adequately maintain the systems in service.

Over time, Rogue computers developed and spare inventory levels were periodically increased until there were 28 spare computers to maintain the 40 in operation. When the computer performance was evaluated, it was discovered that there were 20 Rogues responsible for the overly inflated inventory level. By that time, there was no discernable asset management program; it was strictly reactionary fire fighting at great expense (each computer cost approximately $25,000 USD).

Eventually, the Rogues were repaired and the excess inventory was sold on the surplus market.

Another variation of asset management is the “Just in Time” (JIT) supply program, the intent being to reduce the amount of spare inventory sitting idle on the end-user’s stockroom shelves. When a part is needed, it is shipped “Just in Time” to satisfy the demand.

The effect of receiving a Rogue “Just in Time” can be catastrophic. The mechanical system will experience extended down time while another spare is shipped from the supplier. If there are repeat occurrences of this situation, the JIT program will actually become a liability, creating more problems that it solves and costing more money than it saves.

Engineering

When Rogues have a significant impact on a system’s operational reliability, the component OEM Engineering department may notice a number of short installations, extensive system down time, and an inordinate amount of “No Fault Found” test results. Because no root cause can be determined, their inclination may be to theorize what the component’s weak link may be, and modify the design so it will be more robust or adjust the operating tolerances to be more forgiving.

Typically, these modifications or adjustments do not address the Rogue failure, and there is a risk of introducing a new problem, which will compound the negative effect on the system reliability. If a completely new version is developed, it may have the same problems as the altered version, or it could resolve the Rogue effect temporarily until new Rogues develop as before.

When a significant Rogue population has developed and a general modification is introduced to change the functionality of a part, it can appear that the change had a negative impact on the component reliability – when in fact it was completely harmless.

Case in Point

The decision was made to paint a component population a different color, so the spare pool was painted and introduced into service in order to extract the next lot to be modified. Immediately there were failures of the systems that had received the newly painted components. The first reaction was to stop the modification, and evaluate the painting process to determine how it could be affecting the component’s reliability.

It was discovered that the root cause of the reliability problem was that all the Rogues resting in the spare pool were placed into service at the same time. Fortunately, this was discovered before the Engineering group wasted a lot of time and resources trying to find a nonexistent problem in the paint or painting process.

Rogue Control Program

The essential part of a Rogue control program is to enable their identification. In order to do this, there are a number of basic data elements that are required:

Component Tracking
a. Each component part number and unique serial number. If no serial number exists, then create a unique serial number prior to its entry into service.
b. The identifier of the end item and/or next higher assembly where the component is installed
c. Dates the component is installed and removed
d. Reason for each time the component is removed
End item and/or system maintenance history, with narratives of the system complaint and all maintenance that is performed, regardless of whether parts are replaced during the events.

Once these data elements are captured, a surveillance program can then be set up to identify individual serial numbers that exhibit the following characteristics:

Repeated short in-service installation periods
Repeated identical reasons for removal (system effect)
Removal of this serial number from service resolves the system fault
Shop records indicate that the failure cannot be detected through standard testing or overhaul
procedures (it doesn’t fit the other outlier profiles)

It should be noted that most Rogue surveillance programs consist solely of bullet point 1. However, just because a component has experienced a number of consecutive short installation periods, it does not make it a Rogue. A truly Rogue component fits all 4 criteria elements.

Conclusion

In any repairable component population, there are statistical outliers that are traditionally excluded from asset management models.

In order to control the statistical outliers:

Keep detailed and accurate mechanical system maintenance records
Focus on the statistical outliers
- The general population will take care of itself.
Rogue components will develop
   - A Darwinian “natural selection” process will ensure they displace serviceable spares
   - They will have a compounded negative effect on the asset management process, operational reliability and maintenance effectiveness
   - In order to control the Rogue components
      - Develop a Rogue component identification program
      - Work with the component OEM to improve the repair facility testing so that it will identify and resolve that Rogue failure mode

The small population of statistical outliers needs to be controlled, like the rudder on a ship. If they are allowed to run unchecked, even the best asset management program will wander uncontrollably and may even become a liability, negatively impacting all aspects of the business operation.