Disaster Recovery Planning
It's All About Power
- By Dian Schaffhauser
In the strategic scheme of DRP, straightforward issues of power redundancy and backup often are afterthoughts. But here’s your chance to assess and plan now, before that next power interruption costs your campus dearly.
DELIVERY TRUCKS TAKE DOWN POWER POLES. Squirrels sneak into transformers. Summer heat spawns brownouts and blackouts. Power grids seize. Blizzards, earthquakes, tornadoes, and hurricanes proliferate yearly. And then, of course, there’s always construction, scheduled upgrades, and the unscheduled human error "events" that lurk everywhere. No matter the cause, when the power goes out, your data and operations are at risk.
In fact, data loss from life’s little power calamities may be the most common form of IT disaster any campus can face. According to a 2007 industry association survey, 82 percent of higher education institutions reporting a disruptive occurrence in the five years prior to the study revealed the most common event was simple electrical failure. Even a momentary outage can knock out phone service, emergency notification devices, websites, e-mail, specialty equipment, security systems, research projects, legacy hardware, and critical networks.
UPS Plus Communication Hub Generator
Theresa Rowe, CIO at 18,000-student Oakland University near Detroit, was enjoying Christmas Eve 2006 at home when she got the call. As she describes in an industry association post, that evening a systems administrator had noticed that e-mail was no longer available. When he went to the data center to figure out what the problem was, he found the temperature over 115 degrees in the room; both air conditioning units had failed. Rowe and the facilities manager arrived to discover that some of the servers had gone into automated shutdown due to the heat. The systems administrator proceeded to shut down everything except the most essential systems. Then the small team threw open the doors to the Michigan winter, turned on the fans, and called in the contracted HVAC service provider. When the technician finally arrived, he found that the main electrical feed to the roof chillers wasn’t working. So the next call went to the university electricians, who found the circuit breaker had tripped. The breaker was reset, cooling was restored, and everybody went home around 2 am.
Still, Rowe recounts, she and the facilities manager "didn’t have a good feeling." When they returned to the data center on Christmas Day, they found that the situation had repeated itself. Eventually, the electricians figured out that the power feed to the distribution panel from the 480-volt substation had developed a fault to ground and caused the breakers to trip. A temporary fix was created by rewiring the air conditioning units to another panel.
The good news? Through it all, says Rowe, "We had our drill down." Based on the school’s experience in the great Northeast blackout of 2003, "We knew what an electrical outage would do to our space and how to manage through it." There were problems however, admits Rowe: "Until a few years ago, whenever power went out, our phone systems went out 15 minutes later." Finally, during the 2003 outage, she says, "Our campus community agreed that kind of standard was no longer tolerable."
Rowe was given budget dollars to restructure the communications hub, including phones and the internet, to include an Eaton Powerware UPS and power generator (Generac Power Systems); both models are now out of production. Now, she’s hoping the data center itself will get a generator, but the school is trying to blend that into a broader campuswide generator plan. "Funding is always an issue, particularly in Michigan," sighs Rowe.
Until then, when the power goes out, she says, "We have about 15 minutes before we have to make shutdown decisions. Then we start with non-critical services and turn those off to give people access to critical systems as long as possible."
"The key selling point of traditional telephony was the dial tone available every time the handset was picked up. When we move to VoIP, we have to provide the same level of availability. Otherwise, it’s like a step backward." -Malik Rahman, Central Piedmont CC
Monitoring for Single Points of Failure
Darrin Zeller is operations manager for The University of Tennessee-Knoxville (enrollment: 26,400), where he manages central IT operations in the areas of environment and monitoring. Although most of the outages on his campus are brief (usually caused by electrical storms), more lengthy outages can occur when animals get into the substations, causing shorts, or when crews need to bring power down on campus to do electrical work during construction projects.
"When the power goes out, we have about 15 minutes before we have to make shutdown decisions. We start with non-critical services and turn those off, to give people access to critical systems for as long as possible." -Theresa Rowe, Oakland U
Of course, outages can occur for all sorts of reasons. Recently, a breaker that feeds power to the Eaton Powerware Plus UPS from an emergency panel in an electrical room tripped during a switchover from the power generator back to the utility company. Zeller had no way of monitoring that specific power feed, because it was part of what the facilities department monitors. So even though service should have been restored to normal, the IT operations were running off of batteries. Within 15 minutes, the batteries went dead and everything crashed. The culprit, discovered a month or two after the outage: aged wiring that had gone bad.
From that incident and others like it, Zeller has learned an important lesson: "Any place there’s the potential for a single point of failure, eventually there will probably be a failure."
UT-Knoxville’s data center, which contains hundreds of servers, is housed in a seven-story building. Only Zeller’s floor is hooked up to the generator, a Caterpillar diesel generator set comprising a 3412 diesel engine and an SR4B generator, which provides 750kVA/600kW of coverage. The only service the generator maintains on the other floors is emergency lighting, a small power load compared to the requirements of the data center. If the power stays off for longer than about 10 seconds, says Zeller, the generator starts up and makes the switch. Staff inside the data center notice the switchover only because the lights (which aren’t on a UPS) will go off for a few seconds, then come back on. That’s it. "Everything stays up and running," he says.
Zeller estimates that the generator can sustain an outage of between 24 to 36 hours at full load, without refueling. He’s never had the chance to time it because the campus has never had an outage of that magnitude. But the generator is sized for the capacity to cover services over a holiday, he says- enough time to allow a truck to get to the campus to refuel it.
That doesn’t mean the data center power management setup is foolproof. "There are always those single points of failure," says Zeller, adding practically that "it’s a space-and-money issue to go fully redundant on a lot of these things [UPS, generator, electrical box]." But, he says, "We’re doing much better than we did the prior 10 years when we didn’t have a single UPS."
Still, when the generator finally came in (after several years of delay, due to campus budgetary limits), Zeller made a decision to cut back on the number of batteries in the UPS, from three cabinets’ worth to two. (The data center could run on only half of that, but he prefers to have a level of redundancy.) That saved the school roughly $6,000. (The generator itself cost about $100,000; related hardware and installation costs pushed the entire project to about $500,000.)
Zeller considers cooling and humidity control in the data center a vital aspect of keeping services running. The optimal temperature is 70 degrees, he says; the optimal humidity level 45 percent, plus or minus 5 percent. "If the humidity gets too low, we have a static electricity problem, and if it gets too high, we have a condensation problem," he explains. The data center uses Liebert Deluxe System/3 chilled water environmental control systems, which can humidify and dehumidify, based on sensors in the units.
Right now, the UPS in the data center runs at about 80 percent capacity. Zeller maintains a graph of growth and power usage in the center, and based on the campus’s current rate of growth, he says, "We don’t have too much longer before we’re going to run out of power." By then he hopes to have moved some of the data center equipment that’s part of a research computer cluster, as well as test and development systems, into a building that the university recently purchased, which is already outfitted with a generator. To that end, he’s worked with the facilities people to design what the site will need in terms of wiring and a new UPS.
Emergency Power 101
If your emergency power plan doesn't include these three components, go to the back of the class.
- An uninterruptible power supply (UPS) kicks into action the moment electrical power from a utility stops flowing. In a well-outfitted data center, the idea is to have battery backup sufficient to keep all critical equipment running until the power generator turns on, warms up, stabilizes its flow of electricity, and takes over.
- A power generator, fueled by natural gas, diesel, or some other source, provides a flow of power for long-term outages, whether 15 minutes or 15 hours-whatever exceeds the capabilities of the UPS units. (A source of fuel storage is needed for diesel, but there's less concern about spontaneous fuel combustion, there are ample suppliers, and there's a reputation for lower maintenance and longer life. With natural gas, there's no need for a fuel tank, and some people believe it's "cleaner," but if there's an earthquake or similar disaster to disrupt the gas lines, the campus would have no fuel supply. Still, few agree about which approach is more affordable.) The best models, say the pros, automatically sense when utility power is unavailable, and begin their warm-up process. Once a UPS detects that a stable source of power is coming in from a generator, it shuts down the energy coming from batteries in favor of power supply coming from a generator.
- Air conditioning is the oft-forgotten aspect of emergency power.Without proper cooling, equipment in a data center will eventually fail due to overheating. Any emergency power scheme has to ensure that AC units and other forms of coolers are hooked into the fallback generator system so that cooling doesn't go down in the event of an outage.
Other campus locations Zeller is monitoring have varying degrees of backup. The site where the campus telephone switch is located, for instance, has UPS and generator coverage. But the site where the tape robot system resides lacks a generator. "We’d like to have one, but it’s not as critical," says the operations manager. If the power goes out, "we may lose a day on our backup. That’s a level that people are willing to accept, based upon the cost to install."
New Backup Concern: Power Over Ethernet
North Carolina’s Central Piedmont Community College, with six campuses and enrollment of about 35,000, is outfitted with 88 redundant UPS units and 16 diesel (including some biodiesel) generators, all in place to keep services running during the occasional outage.
CIO Malik Rahman believes the good working relationship his IT organization has with the campus facilities organization is key to its power management success. "For any new [building] design, IT people work with end users to make sure technology needs are met and, at the same time, we work with Facilities [to define] the standards that need to be followed for wiring closets for any technologies we have on campus."
The wiring closets are a particular concern for Piedmont technologists, because the school is deep into a data convergence project with Nortel to upgrade the network within and among its six campus sites. That project includes the upgrade of the infrastructure from 1 to 10 gigabit Ethernet, as well as the implementation of voice over IP (VoIP). Currently, about a third of handsets are VoIP, with the remainder migrating as the school installs new switches with power over Ethernet (PoE).
"The key selling point of traditional telephony was the dial tone available every time the handset was picked up," says Rahman. "When we move to VoIP, we have to provide the same level of availability. Otherwise, it’s like a step backward." That requirement necessitated a level of reliability in every closet housing switches for PoE, so the campus brought in an outside engineering firm to evaluate all of the IT power requirements and study the campus emergency power generation capability. Rahman and his team quickly concluded that there were certain areas where emergency generators did need to be bumped up in size, although 90 percent of all "new" construction (completed in the previous decade) had sufficient emergency power available.
Happily, the growing pervasiveness of Ethernet-powered communications has simplified Rahman’s job of justifying funding requests for emergency power. "The network carries network data and VoIP but also carries our surveillance system video and building control systems," he explains. "Because of that, [the power backup upgrades] become a lifesafety issue. For that reason, the college has invested in emergency power availability in these closets."
Although the campus hasn’t experienced an unplanned outage of its data equipment in a "long, long time," Rahman’s team still has a shutdown process in place as part of its disaster recovery plan, which gets tested annually. The development of that process, he says, has required IT to prepare a catalog of services for tech staff, in the rudimentary stages at this time. That catalog records details of service: "Which server is this running on? Which server is the backup server? How critical is it? What is the population that will be impacted if the server is down? Which other services are either dependent on this service, or is this service dependent on?"
Still, even redundant systems can expose weak areas. In the case of Piedmont, says Rahman, its points of weakness are mainly human. A couple of years ago, he recalls, an electrician from the facilities operation walked into one of the data centers and pushed a red button, "which brought down all of our servers simultaneously." Since then, he says, IT has imposed stringent controls on entry into the data centers.
New Products Accommodate Changes in the Data Center
FIVE YEARS AGO, when Chris Turner, sales manager in the public sector division of American Power Conversion, was selling backup power devices (such as UPS units) into data centers, racks held a maximum of 2 to 4 kilowatt-hours of power requirement. Since that time, with the proliferation of blade and other minimalist server hardware, more and more equipment is being squeezed into the same amount of space. Now, says Turner, some racks house up to 20 kilowatt-hours of power, which leads to more heat being generated in a very small space. To accommodate the power ramp-ups, vendors in the space now offer cooling solutions to replace traditional perimeter air conditioners that encompass entire rooms. The thinking here: Place the cooled air as close as possible to the source of the heat.
Gary Forbes, power and cooling specialist for CDW-G, points out that every vendor in this end of the business has a different technique and strategy for delivering the optimal cooling. "Some cooling devices reside in the rack itself; some next to it; some on top of it." What determines the best approach, he says, is "how much space you have."
In the case of APC, the InfraStruXure InRow RC cooling solution itself resembles a very narrow rack. It captures heat directly from the rack aisle, and distributes cool air, maintaining equipment temperatures preset to optimal levels.
Forbes says companies in his market space, including his own, also are pushing green solutions which, like hybrid automobiles, can go into energy conservation mode and run even more efficiently. They may take a form as simple as "0U" (or zero U, which means it takes up no height on the rack), or surge strips that can be managed remotely, fit directly on the rack, and measure how much power is being drawn from each device. Or the greener form may be as complex as an entire data center infrastructure which arrives in modular form and can be expanded as data center needs grow.
As APC's Turner explains, to date it has been common for campus technologists to design a data center according to what they thought their power needs would be in five or 10 years, and then size their backup solutions for that escalating scenario. Yet often, he says, "Those power needs never grew to where [campus IT and facilities] thought they'd be, which is very inefficient." He adds that that approach ties up capital and operating funds needlessly. The new methodology: Start with a minimal setup, and then add additional modules only as your needs grow.
As the Data Center Goes, So Goes the School
Dick Bednar, senior IT director at California State University, Fullerton, says that even though his campus has a generator that will keep the data center and PBX in operation ("as long as we keep putting diesel in"), nobody will be able to work in any of the other buildings, since they won’t have power.
But the fact that all the services delivered from that data center will remain up is the reason why backup emergency power is now so vital to a college or university. Geographic or physical location of workers accessing that data has become of lesser importance. After all, even if a school’s offices can’t stay open due to a power outage on the property, many people still can work from home via the internet. As long as the electricity keeps flowing to those other locations, a university’s work can keep humming along.
Disaster Recovery: Personal and Up Close. Many disaster recovery plans ignore the human aspect.
It's All About Risk. The shootings at Virginia Tech are a textbook example of the need for business continuity management.