MCR!RCM - Looking at RCM in a Mirror

Note! If you don't agree with 7.5% call it whatever you want, it doesn't really matter about the percentage.

The FMEA portion of RCM is the time consuming part so just utilizing FMEA saves little in effort. In FMEA you look at individual failure modes, cost justify solutions and apply solutions that give a positive ROI. You look at failures line by line. At the end of the FMEA you will compromise on some things as you start to see logical groupings of work that if considered as a whole will alter to Cost / benefit numbers. The output is then planned, scheduled, executed and followed up. You will miss some failure modes that will later manifest as breakdowns and be passed through the RCA process into the continuous improvement loop.

So let's look full circle to how maintenance programs were initially developed. Some skilled trade's man would look at an asset and decide if I lubricate and maintain this item I will have to fix it less often. Yes, things were done at the wrong frequency and many breakdowns occurred. This lead to the PM concept and things were replaced on time based intervals. Along came Noland and Heap and showed people that breakdowns were not linked to time and by working on them we induce infant failures, and so RCM was born. In a drive to be more cost effective CBM was adopted (it was developed much earlier than RCM). Over the next 25 years the CBM and Work ID world blossomed into an industry. There was money to be had by selling plants as much monitoring and consulting services as possible. There were a lot of technical and tactical solutions developed and marketed, some of which actually had value.

Post the new millennium the level of knowledge and understanding of maintenance has blossomed. In house maintenance people are no longer mystified by the logic contained in a Life cycle curve. Vibration spectra is not the voodoo of a select few that mystify others with graphs of squiggly lines. There is a high level of understanding of lubrication management, we no longer mix dirty oils and wonder why we have failures. We get the requirements for precision maintenance and QA / QC practices. We understand business processes, roles and responsibilities. We have advanced tools of all types to manage data to a level unheard of previously. They have a plethora of online data management options to utilize. The value of doing the right work at the right time with the right quality is ingrained in our day to day operations. Even the quality, consistency and durability of the replacement parts have changed (with the exception of planned obstalesance).

The question I would like to pose is, do you really believe we have learned nothing post RCM. If Noland and Heap were to conduct the same study today would they get the same output? I do acknowledge many individuals and companies make substantial income from the application of RCM, and yes it is applicable to 7.5% (or whatever %) of Assets. But Current findings based on studies show that in organizations with good maintenance practices (Good by 2010 standards) the number of random failures has diminished substantially. There are also substantially less infant mortality issues if good commissioning and QA /QC practices are utilized. The life cycle and failure profiles of assets are also understood to a deeper level and not correlated to human life intervals (If it lasts longer than me then the life must be forever). The concept that assets do not have a life is unacceptable, show me an asset that can be turned on today and will still be running in 1000 years. The key point in this is things have changed, so what if we gave the technician today another chance. Gasp! That was the problem that created all this! They did the wrong things because they didn't know what to do! So let's see if there could be a qualified person to develop a maintenance program?

The process would require a skill set, as one individual would not know everything he would need a good overall understanding with some key points and facilitation skills. Not group facilitation (you're going back to RCM thinking I can tell). So let's list the skills that would be required, think in house so the requirements would be specific to your industry.

  • Mechanical aptitude
  • Electrical aptitude
  • Instrumentation aptitude
  • Lubrication management knowledge
  • Civil Aptitude
  • Process knowledge
  • Operational understanding
  • Work management knowledge
  • Accounting (cost / Benefit) Understanding
  • Risk management understanding
  • Reliability process Knowledge (RCM, TPM, RCA, Criticality, Process mapping etc: )
  • CBM knowledge, Level 2 min. Vibration analysis, Oil analysis, Ultra sound (Db, and thickness), Thermograph, reciprocation monitoring and any thing else that applies to your industry.
  • Statistical analysis knowledge
  • Understanding of regulatory compliance requirements
  • CMMS, EAM, Enterprise system knowledge

Wow that's a tall order, could anyone fit the bill? Wait, that is the reliability engineer, or if not he should be. This would fit the mold of most people I have met that are CMRP certified (certified maintenance and reliability professionals). The CRE (Certified reliability engineer) designation is also nice but is some what more focused on product reliability then asset reliability. So we have identified the person. What?! You do not have a reliability position! Make one. Let's move on.

So there is someone who has the skill set to develop maintenance programs. Now if you have the luxury of this person also having people skills to enable co-operation of others you are good to go. So what are the options to develop a maintenance program? We currently seem to have few options despite the logarithmic increase in understanding.

Option one: Engineered decision.
Just look at the assets and decide based on the knowledge of the individual.
Advantages - Fast and easy
Disadvantages - Inconsistent and inaccurate.

Option Two: PMO (Preventive maintenance optimization)
A good compromise but still failure mode driven and somewhat time consuming. Process assumes work is identified to a certain level and just needs validation.
Advantages - Good ROI for effort required
Disadvantages - leaves gaps in overall life cycle management.

Option three: FMEA
Facilitated sessions that list failure modes and identify solutions that are cost justified.
Advantages - High level of detail and includes cost benefit.
Disadvantages - High level of effort, tends to miss big picture alignment.

Option Four: RCM
RCM is the highest level of diligence available, preservation of function driven and systematically applied. This process will manage risk and drive the correct level of intervention based on the identified risk.
Advantages - Highest risk management. Will reveal most critical failure modes.
Disadvantages - Can be misapplied (not the fault of the process), Takes substantial time to implement, is cost prohibitive to implement on a large scale.

So if we look at the processes one is an educated guess (engineering), and the other three are driven by looking at each failure mode one at a time. The results are then correlated into action plans, the action plans are then optimized that in some cases forces review of the output. There can be thousands of failure modes and many will be reviewed and parked as OTF.

There is no shoe that fits all scenarios so let's assume you have completed an asset prioritization exercise. If you haven't, have the Reliability guy you just hired do it. You have grouped you assets in five (I don't care how many this is just an example) groups. Your high critical assets are there due to safety, environmental or throughput concerns. If your high level can kill people or seriously impact the environment consider RCM. If your high level can only incur production losses FMEA or PMO is fine. Your lowest level will have little or no impact so just look at life extension (I may want to grease it) or conduct minor maintenance, engineered decision works fine here, you should be able to do this at a quick rate. Many processes will state that all low level assets should be OTF, true OTF means you do nothing, no grease no belt tightening NOTHING. If the asset requires any attention at all then it is not OTF. If I really don't care about the asset, why is it even there? Get rid of it. Do not confuse my comments with OTF on higher levels, sometimes it is the most cost effective way to maintain something.

We now have the high and low levels taken care of but the majority of our assets are in the middle three ranges. How can we efficiently deal with the bulk of the assets? As a consultant I, OK I said consultant, I wasn't always on the dark side! I have been called in by numerous companies that wanted a cost effective maintenance program developed for them. They wanted turn key solutions and they did not want to pay for a FMEA approach. The goal was in one case to develop a cost justified maintenance program for a mid size pulp mill, this included development of maintenance strategies, routes in hand held computers with integration and setup of an onsite oil analysis lab. The time line was three months. Sound impossible, as with any projects if the constraint is time the quality of the output is limited. So how can "good quality" be created without time to apply the level of diligence? It is understood that even RCM will miss some failure modes so we know some will be missed. There are also multiple failure modes that we can't do anything about so they will be OTF.

So what if the RCM was inverted? Rather than have the Failure mode as the input what if it was the output of the process. Ok now you just think I am crazy but let me explain the concept. There are things I can do to mange the health of an asset so let's list them somewhat in order of ease of execution.

Tasks
Operator dynamic inspections
Maintenance dynamic inspections
Operations static inspections
Maintenance static inspections
Lubrication management
On-condition monitoring (Level switches, vib trips etc:)
PDM
Regulatory inspections.
Operations autonomous maintenance (Cleaning, adjusting)
Phased overhauls
General overhauls
Failure finding tasks

There are also conditions that affect the frequency and the health of the asset that drive strategy selection.

Operational envelope
Asset loading (Over loaded, under loaded, shock loaded, normal loading).
Asset operational phase (startup, constant and shutdown)
Assets life cycle, new, mid life, end of life.

The third criteria to understand is to business goals that define the operational campaign.

Operational Campaign
Is the Campaign flexible?
What is the impact of unreliability ( lost revenue, lost sales, overtime for rework).
What is the Campaign, 24/7 with yearly shutdown. 8/5 with weekend maintenance.

As I stated earlier the person applying the process should have good knowledge of your operation. If you're the guy that just hired the reliability engineer have him work as a helper for operations for the first few weeks he will get it (might as well deal with that ego thing) if he is still there on week three you have the right guy. Now here is the job. Think of the first strategy "Operator dynamic inspections" the operator in a certain areas will have a job to do and a number of assets to look after. The first question would be is there time for him to conduct a dynamic inspection? Playing cards in the control room should not be a limiting factor, remember we want things as cost effective as possible so we will only inspect on the correct intervals. Team up the most capable operator with the RE and have them define the boundaries of the operator's area of responsibility. Walk through the area and define safe dynamic inspection points for the route. Would any specialized equipment enhance the inspection? Strobe lights, temp guns, they are cheap now so keep them in mind. Walk from asset to asset determining what failures can be detected from a dynamic walk through inspection. List the failure modes and do a rough cost benefit on the route. Ouch, I just got hit by a rock with a note "you will over inspect some points don't need to be looked at for years". Yes I get that you can over inspect but the route itself has been cost justified so if I have someone look at something too often then what would you have me do, put blinders on him so he doesn't look, he's there anyway. I would also like to point out that if the RE person doing this fits the criteria mentioned before he knows all that and will optimize the route as required. A good rule of thumb is that an operations inspection should be able to be conducted at a walk. If there is an operator dedicated to inspection the level of detail can be increased. At the end of the exercise you will have a list of all failure modes that an inspection can detect. As far as the frequency goes do the route at ½ the shortest PF interval. In some cases I have seen daily inspections, in all cases daily inspections are process driven not failure mode driven. If the process requires a walk though let's tie in the mechanical inspection, there we now have a machine health inspection that hasn't cost us any additional resources and can be happily over done.

Now how about a maintenance dynamic inspection, again what area is the tradesman responsible for? Is his job proactive work or reactive. In one plant I separated the two groups into Proactive and Reactive groups and staffed them by their preference. The proactive group's job was to put the reactive group out of reactive work and force them to work in the component rebuild shop, it worked. So now the approach would be to look at the assets with the best tradesman and the RE. What failure modes would be visible to the maintainer with a dynamic inspection? This would involve strobe lights, temp guns or whatever was applicable, you know tradesmen they like shiny things on their belts. The point of focusing on dynamic inspections first is that the assets are running at the time. The more work I can accomplish when the assets are running the better. It is normally easy to cost justify work that does not require downtime. It is important at this point to get the correct frequency of inspection as the maintainers job is to inspect. If you get carried away with inspections you will need more maintainers. I will again note the RE should be qualified and capable of taking the failure modes that can be identified with the dynamic inspection and correlate them to the correct frequency and route.

By now you should be getting my point, start with the strategy, look at what failure modes you can detect by applying the strategy. Cost justify the work based on routes matched to groups of failure modes. Do not double dip. If one failure mode detects or manages a failure only use one, the exception is overlap like vibration and oil analysis, but they also have individual failure mode detect ability. So does this sound familiar, yes it is the old school way of developing maintenance programs, we are just a lot more qualified to do it now.

So we get the process for inspection work, lets look at it from a general overhaul (GO) perspective. Again team up the RE with a good maintenance guy and if brown field look at all historical GOs, if green field just apply experience from similar assets. The first concern would be, is this a repairable or replicable asset. What part would trigger a general overhaul? Can I monitor the part that would trigger the GO, if so at what cost?  How would the failure influence production, what is the ideal configuration of the replacement. How many failure modes can be addressed if a GO is conducted at the correct interval, triggered by what? Time, cycles, condition, throughput.

By the end of the exercise you should have a substantial list of failure modes that are addressed by solutions. The solutions will already be grouped by logical sequences and cost justified by groups. Review the list of failure modes created, look for things that may not have been addressed; by this point the only option should be a new solution you haven't considered implementing yet. Yes you will miss some but they will mostly be the ones that are undetectable (mostly identified by OTF in RCM). The catch is that this process requires someone with very strong knowledge to be applicable; you are taking a shortcut so you should know the detailed way well, before you embark on this trip. As stated earlier the level of ability currently found in reliability geeks is high enough to understand the risks involved. This process must be supported by QA /QC and follow up. It is a fast and easy way to develop a turn key program that will be a stable foundation for continues improvement. Now if only I could think up an acronym so I could sell it!

Turn The Key and CYA later

by Jeff Smith
Reliability Laboratory
www.reliabilitylaboratory.com
smith@rlab.biz