Picture this: Like many Americans, you own a motor vehicle of some sort. That vehicle gets you from Point A to Point B, day in and day out. But, if you’re like most vehicle owners, you don’t consider basic routine maintenance, even on something you rely on the most. It is more of a break/fix relationship. Unfortunately, this same type of relationship is not uncommon in today’s data center industry. Various types of equipment are running 24 hours a day, 365 days a year, which comes out to 8,760 hours in a year. Data centers cannot afford any of their equipment to fail. Therefore, preventive maintenance (PM) is a must.
Preventive maintenance
is routine maintenance, performed to ensure asset reliability and eliminate any equipment failures and/or downtime that may occur. Preventive maintenance should be viewed as a proactive approach that establishes scheduled inspections of assets to verify dependability, as well as prolongs asset longevity.
Data centers cannot afford any of their equipment to fail
Today, data center operators spendmind-boggling amounts of money to get the newest data hall complete for an incoming tenant, but they may not give much thought on the front end to having a PM plan in place when the original equipment manufacturer’s (OEM’s) warranty runs out. Granted, most issues with new equipment are found during start-up and commissioning, but what happens three to five years down the road when an incident occurs?
Reactive maintenance is a common practice for some facilities. Being the opposite of preventive maintenance, reactive maintenance is essentially waiting for an incident to occur. This practice may seem like a cost saving strategy, but when unplanned downtime occurs, you spend more time fixing the issue than if you had a PM plan in place. This delayed maintenance could result in negative publicity for your facility and, in turn, compromise your customers’ trust. A PM program is meant to alleviate these unforeseen outages and help save facilities time and money.
Equipment that is not regularly serviced can create a hazardous and unsafe workplace environment. Having a PM program in place helps ensure the safety of employees in the facility, eliminating injuries and accidents.
More importantly, factory-trained technicians, in collaboration with data center facility managers, should perform the PMs to ensure service level agreements (SLAs) are not breached. For example, if an SLA requires the colocation provider to perform routine maintenance annually to uphold the agreement and ensure the customer’s data isn’t compromised, this requirement must be met.
Another aspect to consider as part of any good PM program is an equipment lifecycle plan, where IT managers need to:
- Rotate out equipment for routine overhauls;
- Replace devices before they fail;
- Modify or update devices;
- Set lifecycle dates and replace when necessary.
Along with a good PM plan, it is also important to have a power protection plan (PPP) in place to eliminate or minimize downtime in a data center. Every PPP needs to include the following process:
- Comprehensive preventive maintenance visit annually by a factory-trained customer service engineer;
- Execution vis-a-vis a thorough visual inspection of all parts (e.g., bulbs, displays, missing hardware, cleanliness), with corrections made as needed;
- Verification and calibration of all monitoring components;
- Verification of proper operation and condition of the system;
- System check for proper load balance, kVA usage and building alarm status;
- Complete reporting on all services rendered;
- Infrared scanning of internal connections to seek out hot spots that lead to component deterioration and disaster before they occur;
- Four hour response time during downtime;
- 24x7 access to telephone technical assistance from OEMs;
- Costs for parts and labor required to correct any problems or keeping the system in good operating condition;
- Guaranteed parts availability;
- On-site spare parts kit.
Figure 1: Field service technician verifies static transfer switch voltages and performs visual inspection
Figure 2: Field service technician confirming calibration and functionality on a static transfer switch during a routine PM visit
Needless to say, during downtime, on-site factory spare parts are essential for reducing the mean time to repair (MTTR). While factory technicians may be able to respond quickly, not having parts on-site means a longer wait period for them to be shipped from the OEM.
Another approach to ensuring maximum uptime is to make sure your facility’s team is factory trained by the OEM. This eliminates nuisance service calls to the OEM, which often causes customer anxiety.
With the vast expansion of the Internet of Things (IoT) and semiautonomous cars, reliability of data centers and cloud storage is a necessity, not just a nice to have feature. Preventive maintenance should always play a major role in data center operations. But, without proper foresight and a well-thought-out plan for maintenance, data center managers are inadvertently steering the business toward having more problems than it needs to have.
Mitigating risk is possible by charting a preventive maintenance course and preparing for the possible risk factors. In this manner, if a power outage does occur, the impact is reduced and the organization does not become the next data center failure headline.