The Business Impact of Downtime at Data Centers

Downtime in these facilities is not an option. Infrared thermography is being utilized for regular electrical switchgear surveys, optimizing of cooling systems and servers, and commissioning of all electrical equipment, including UPS modules , PDU (power distribution unit) equipment and computer servers. Many construction project specifications have infrared surveys as a requirement before the building is turned over to the owner. Data center infrared thermography must have total accountability for all infrared data in the commissioning process, regardless of whether or not there are problems. This accountability can be achieved by documenting all equipment inspected with time, date, location and equipment condition. The thermographer must create a data log and record the infrared video onto a digital storage device of some type. New technologies in data acquisition and report preparation will make historical data (images previously taken) available for comparison. This will enable the thermographer to more closely compare circuit boards and other UPS equipment with previously acquired images. If something fails or causes downtime in the system, an IR image of that component may be referenced to document that the equipment was operational, at thermal steady-state and in acceptable condition when the survey was made.

Table 1: Uptime and Maximum Downtime;Table 2: Data Center Downtime Losses  Industry Sector

Estimates for other industries provide a cross-check. A 2004 survey, for instance, put losses on brokerage operations at $4,500,000/hour, banking operations at $2,100,000/hour, media operations at $1,150,000/hour and e-commerce operations at $113,000/hour. Retail operations trailed at $90,000/hour. Share value for some companies can be affected, for example e-Bay's outages in 1999 saw shares temporarily drop by over 26 percent, while e*Trade's similar problems saw a 22 percent temporary drop.

(2. Hiles, Andrew 2004)

IR Commissioning of Data Center Equipment

The commissioning process should include these types of equipment and considerations. The following infrastructure support equipment should be tested:

  • Cooling systems, including chillers and all HVAC equipment.
  • CRAC (computer room air conditioning) units.
  • All associated switchgear.
  • Emergency diesel generator systems.
  • ATS (automatic transfer switch) equipment.
  • UPS modules.
  • Resistive load banks and associated cables/connectors.
  • Static transfer switches.
  • Rotary UPS system, if applicable.
  • Battery banks, breakers and charging systems.
  • Transformers (utility and site).
  • PDU (power distribution unit) equipment.
  • All distribution electrical panels.

Loading Considerations:

  • All normal switchgear and electrical panels need to be checked under load.
  • Test generator leads and emergency source for the automatic transfer switches under load.
  • Resistive load banks must be attached to the PDUs and tested with increasing load percentages.
  • Each UPS module must be tested independently including a full load battery test.
  • UPS battery connections and individual battery cells should be checked during and after the discharge.
  • Rotary UPS systems must be checked during operation. (Rotary systems utilize the same rectifier technology as static topologies on the front end to create DC current from AC, but use spanning motor-generators to re-create the sine wave on the output.)
  • Each PDU must be tested on both the preferred and alternate sources as well as in each respective bypass.
  • All normal transfers should be verified operable.
  • PDU distribution breakers must be checked after they are put into service on the panel boards.

Figure 1) Resistive load bank is shown with a bad cable connection.

Figure 1) Resistive load bank is shown with a bad cable connection.

Causes for Electrical Failure and Downtime in Data Centers

The critical power distribution system takes conditioned power from the UPS and distributes it throughout the facility to individual loads. Most site failures occur in areas where hot electrical work is required and physical maintenance is difficult to perform.

Typical causes for failures include:

· cover slipped while accessing load panel,

· overheated breakers tripped unexpectedly,

· wires were not physically secured under screws,

· screws were not torqued adequately,

· wires or circuit breaker handles were dislodged while adjacent work was being performed,

· screws were stripped,

· insulation was skinned causing faulted wires,

· rotations were reversed.

(3. UpTime Institute, 2006)

Infrared Applications for Servers and Server Racks

Ten percent of all server racks currently in service are too hot to meet industry standards for maximum IT reliability and performance.

"Institute research into computer room cooling indicates 1/3 all perforated tiles are incorrectly located and 60% of all available cooling capacity is being wasted by bypass airflow. Increasing under-floor static pressure to get air where it needs to go requires permanently blocking all unnecessary air escape routes. This includes sealing cable cutouts behind and underneath products or racks (this unmanaged airflow is what is really cooling most computer rooms) as well as the penetrations in the floor or walls or ceiling and any other openings in the raised floor. Perforated floor tiles with 25% openings can be replaced with 40% and 60% grates to permit a much higher airflow. For sites with unused raised floor space deliberately spreading equipment out to create white space and reduce the averaged gross watts per square foot power consumption will be a viable option."

(4. Brill, Kenneth 2006)

Figure 2) Server cooling fans are shown. The top fan is operating normally, while the bottom fan has failed.

Figure 2) Server cooling fans are shown. The top fan is operating normally, while the bottom fan has failed.

Figure 3) High density server being tested at increasing CPU utilizations.

Figure 3) High density server being tested at increasing CPU utilizations.

Server infrared applications include:

· Thermally mapping complete data center from sub-floor to ceiling.

· Verifying proper hot aisle/cold aisle operation preventing short circuiting and bypassing of air flow.

· Verifying high density server farm cooling capabilities.

· Monitoring server rack temperature distribution patterns.

· Finding internal server fans which are inoperable or damaged.

Figure 4) Verifying proper hot aisle/cold aisle operation.

Figure 4) Verifying proper hot aisle/cold aisle operation.

Safety Considerations

Of course, the thermographer must comply with all OSHA and NFPA 70E regulations. The good news is that unlike most industrial sites, the switchgear rooms and data centers have controlled temperatures and low humidity, which makes the use of the arch flash suits and associated safety equipment much less onerous for the thermographer.

Figure 5) Thermographer inspecting a battery bank during full battery discharge testing.

Figure 5) Thermographer inspecting a battery bank during full battery discharge testing.

How does a thermographer become "qualified" and obtain contracts to do data center thermal survey work?

First, the thermographer must understand the critical nature of the equipment being tested as well as the surrounding equipment. Furthermore, he/she should understand that the work he/she is performing is critical and vital to the operation. A thermographer wanting to do this type of work should get general training and certification on electrical switchgear and also get specific training on data center equipment. He/she should contact UPS vendors and their clients and cultivate relationships with them.

Since this work has a high accountability, the methodology for performing the surveys and creating the reports must be "upgraded" from the typical office building or factory. This means the thermographer must use a high resolution, radiometric and sensitive thermal imager and learn how to record all thermal, visual and textual data by using a detailed data logging system. Also, data center specific work schedules often include nighttime maintenance windows from Saturday midnight until Sunday morning, therefore the thermographer must get used to working during off-peak times. We know that large companies commission all data center equipment, so do the smaller companies have UPS and server systems? Absolutely! In order to successfully complete the commissioning process and maintain the systems, large and small companies must find thermographers that are close-by and have experience in critical facility activities. A thermographer interested in providing these services must be commercially available to the UPS, electrical and facilities maintenance contractors. Having a great professional reputation with no accidents or system failures is essential to being the preferred thermographer for data center infrared work. What these infrared service clients want...are the most professional, experienced and qualified thermographers in the electrical infrared industry.

References:

1. Hiles, Andrew, (2004) Five Nines: Chasing the Dream? <continuitycentral.com> Continuity Central (12/18/06)

2. Hiles, Andrew, (2004) Five Nines: Chasing the Dream? <continuitycentral.com> Continuity Central (12/18/06)

3. UpTime Institute, (2006) Procedures and Guidelines for Safely Working in an Active Data Center pg 9., <uptimeinstitute.org> UpTime Institute (12/18/06).

4. Brill, Kenneth, (2006) 2005-2010 Heat Density Trends in Data Processing, Computer Systems, and Telecommunications Equipment: Perspectives, Implications and the Current Reality in Many Data Centers. P. 13 <uptimeinstitute.org> UpTime Institute (12/18/06).

Author Bio:

Eric R. Stockton received a BA in Zoology from the University of North Carolina at Chapel Hill in 1982. He was an environmental consultant for Carolina Power and Light's Shearon Harris Nuclear Power Plant for 14 years before becoming Vice President of Stockton Infrared Thermographic Services, Inc. He now manages the CompuScanIRTM, ElectriScanIRTM and ConnectIRTM divisions.

Copyright January 2007

Published at IR/Info 2007 Conference in Orlando, FL. by Infraspection Institute

Upcoming Events

August 9 - August 11 2022

MaximoWorld 2022

View all Events
banner
80% of Reliabilityweb.com newsletter subscribers report finding something used to improve their jobs on a regular basis.
Subscribers get exclusive content. Just released...MRO Best Practices Special Report - a $399 value!
DOWNLOAD NOW
DIPF Curve and RCM Failure Patterns

Predictive Maintenance Deja Vu All Over Again

Compared to total asset failures, what percentage of asset failures can be "reliably predicted" with predictive maintenance?

Digitalization Strategies for Reliability and Asset Management

Digitalization Strategies for Reliability and Asset Management

Uptime Elements Reliability Framework interoperable with Uptime Elements Digitalization Strategy Framework