Reliabilityweb Equipment Reliability Assessment by Sampling Standard Information from Operations

Equipment Reliability Assessment by Sampling Standard Information from Operations

Basic maintenance information contained in orders for repair work and preventive maintenance is managed with the help of a computerized system. The operations information contained in several types of production reports, including detailed documentation of individual incidents resulting in train delays, is also managed by the same computerized system.

Until recently, tracking the number and duration of train delays attributed to a specific system had been Amtrak's only approach for routinely assessing equipment reliability. Figure 1 shows a typical representation of reliability performance, as expressed by the number of incidents attributed to each system of Amtrak's HHP-8 locomotive fleet of 16 units between December 11, 2013 and March 12, 2014. The graph shows that the automatic train control (ATC) system was the highest contributor to work orders generated by delay incidents, with 29.9 percent of the total.

Figure 1: Contribution of HHP-8 locomotive systems to incident work orders

The graph intends to rank systems by reliability performance, with poorest performers on top. It suggests that resources and attention should be focused on the ATC system as an area with opportunities for improvement.

However, it has been known for some time that this kind of representation lacks two basic notions of performance measurement: reference and causation.

First, without a performance reference or base line, it is very difficult to qualify the actual performance level of any individual system. The inherent or achievable performance of any of those systems is missing in the analysis. Thus, the quantification of the potential improvementopportunity, represented by the gap between current and achievable performance, is not there to allow clear quantification of goals for an improvement effort.

Secondly, without enough understanding of the causes for both achievable and current performances, it is very difficult to identify the areas that hold opportunities for improvement. The examination of the records represented in Figure 1 reveals that incidents associated with the ATC system contributed to 29.9 percent of delays. However, that number is not only comprised of what could be reasonably attributed to the system itself, but also includes effects of other systems' failures, debris damage, human error and other events.

The two shortcomings, in general, are due to the fact that standard operational data normally collected and analyzed through computerized systems is not intended to expediently produce performance reference and causation. This kind of data is intended to produce transaction records needed to run operations in an organized way. Using such data as crude input to performance assessment has been proven ineffective.

To overcome those two limitations of performance analyses, Amtrak considered five possible options:

• Judging the causes of all incidents;
• Benchmarking performance against a similar operation;
• Implementing a commercially available failure database system;
• Contracting the development of a formal reliability model of the
operation;
• Judging the causes of a representative sample of all incidents

The fifth option looked by far the most attractive in terms of cost effectiveness and budget. To test this perception, an initial exercise was designed and implemented to categorize the causes of all incidents associated with the complete rolling stock fleet.

The target data set consisted of incident work orders written for all rolling stock during the first 338 days of fiscal year 2012. The sample size was defined by a number of randomly chosen dates.

To select the random dates for the target data sample, a control database containing date, responsibility code, delay duration and region was used. The randomly chosen set of dates in the sample from the control database was continuously increased until the percentage of delay minutes per responsibility code and region was practically the same in the sample and the complete set. Very good representation was obtained by a sample size of 57 dates.

The causes of the 532 work orders in the target data set, written on the 57 chosen dates, were judged to fall within one of the following categories:

• Equipment failure;
• Nuisance trip;
• Debris damage;
• Operator error;
• Maintenance error;
• Inspection error;
• Other.

A fraction of the target data set is shown in Figure 2 (following page) as incident records, including judgment of cause category. Information related to each work order in the sample was scrutinized until a judgment was reached or, after 10 minutes, the category was labeled "undetermined."

Figure 2: Examples of incident work orders and cause category judgment

As expected, the leading cause category was equipment failure, with 40 percent of the sample population, as shown in Figure 3. These failures occurred during train travel and no obvious indication of their imminence was observed during daily regulatory inspection before departure. Many incidents in this category are associated with failures of electric and electronic components and reflect the inherent reliability of the equipment. Very few opportunities for performance improvement are expected to be revealed by analyzing these incidents.

The second largest cause category was nuisance trip, with 11 percent of the sample. Typically, these incidents are very short shutdowns for no apparent reason that do not require any further intervention beyond restarting to resume travel. Practically all of these were associated with electronic components. Eliminating some of these trips require extensive investigation, while most of them are typically unavoidable.

Adding seven percent of debris damage to the latter two categories completes a set of 58 percent representing the causes of delay very unlikely to be significantly reduced unless design changes are introduced. This is the kind of performance reference that was missing until now.

Nearly 26 percent of delays was judged as being caused by some kind of human error, which in principle represent opportunities for improvement, typically through training. There were 35 delays, or seven percent, classified as the diagnosing issue subcategory within the maintenance errors category, which represents a significant improvement opportunity for the maintenance organization.

The results of this sampling exercise are in line with currently perceived opportunities for improvement at Amtrak and numerically express the causation of train delays that was missing until now.

For the majority of incident causes judged as equipment failure, it took relatively little time to check that a part was replaced or repaired and that the same failure had not recurred by the time of the analysis, making most judgments straightforward. Other incidents required significant time checking information to ensure reasonable level of confidence on the judgment. Only four percent of the sample delays was considered almost impossible to judge in 10 minutes, so these delays were left as unknown without investing any significant time in them or compromising the validity of the assessment.

Figure 3: Cause categories of delays associated with rolling stock

This type of analysis has been adopted as the preferred option and is being applied to several questions and areas of performance. It allows for effective use of the high volume of standard operations data accumulated in Amtrak's computerized systems.

Sampling and judging performance records clearly show potential as a technique to fill voids of reference and causation. The exercise at Amtrak demonstrated the cost-effectiveness and the necessary accuracy of performance measurement achievable by expedited judgment of most of the incidents in the sample. This approach should work for any kind of performance analysis of equipment where extensive operations electronic data is available. The method of sampling and judging standard operational data to extract valuable conclusions from information not otherwise produced with data analysis in mind is both inexpensive and accessible.

Alex Gotera is currently one of the Reliability Engineersin the RCM Team at Amtrak. For over thirty years, Alex has worked as analyst, consultant and plant engineerin several plants and projects in power, process, manufacturing, and transportation industries. He has formulated and coached reliability improvement projects using a comprehensive and sustainableapproach to process improvement for several industrialoperations. Alex is an active member of ASME and has a Master's degree in Mechanical Engineering.www.amtrak.com/home