Purdue Develops Technique to Keep Computers Running in Overheated Data Center

To stay safe in hot conditions, experts recommend that workers move more slowly. The same advice fits data centers, apparently. To prevent outages and keep work moving, Purdue University has successfully tested out a technique for controlling operations of its computing clusters in overheating conditions by slowing down the performance of its nodes.

That's proving a boon to the computing operations that researchers rely on at the Rosen Center for Advanced Computing at the Lafayette, IN institution. The center provides computing infrastructure services to researchers on campus and around the country. Frequently, those research projects require months of continuous computing time on thousands of processors. If something happens to shut down the computer operations while a massive multi-month calculation is being performed, the job usually has to start again from the beginning. In other words, an outage is "guaranteed" to affect many groups on campus, according to Patrick Finnegan, Unix systems administrator in Rosen's IT Systems and Operations group.

Power outages are actually infrequent at the data center, Finnegan said. But he added that this summer, "due to some planned cooling system maintenance, coupled with the unusually hot summer, we have had some brief cooling outages." When the temperature in the data center exceeds a certain point, the racks of computers have to be shut down. If that's not done intentionally, they'll shut themselves down. And that, said Mike Shuey, high-performance computing systems manager, "has ripple effects on the research efforts of the university for weeks afterward."

Twice so far this summer shutdowns have been called for, according to Finnegan. "In both instances, the cause was a temporary capacity reduction in the campus chilled water supply." That 50-degree water supply cools the entire facility, which includes about 15,000 processors, along with other computing systems in use in the space. When the cooling system is turned off, temperatures in the room can reach in the high 80s and 90s.

To address the planned outages, Finnegan developed a technique that allows the center to continue operating the computers--though at a reduced performance level. "Basically, I use the power saving features present in almost all modern systems to slow down the system, while at the same time reducing power and cooling usage," he explained. "This is similar to how your laptop saves power to extend battery life. Then when things are back to normal, we just turn the systems back up to full speed, and everything takes off like normal."

After Finnegan's system was implemented, he said, the temperature sensors informed the IT crew that the temperatures were going down again. In one instance, on an AMD-based cluster specifically, power usage dropped from an average of about 290 kilowatts to about 205 kW, with a performance decrease of between 50 percent and 70 percent.

"The program worked, and the datacenter didn't overheat, so the process was a success. We actually were a bit surprised it worked so seamlessly," said Shuey. "It's much better to have jobs run slowly for an hour than to throw away everyone's work in progress and mobilize staff to try to fix things."

With the successful use of his scheme, Finnegan became a datacenter hero. "I was a bit overwhelmed by all of the positive responses I got," he said.

The Purdue crew has written up its procedures and is making them available through a storefront on foliodirect, a Web site that sells licensable university technology. "High Performance Computing Power Saving Device" is priced at $250.

According to an abstract for the report, the software runs on most distributions of Linux running x86 64-bit AMD and Intel processors.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • glowing lines connecting colorful nodes on a deep blue and black gradient background

    Juniper Launches AI-Native Networking and Security Management Platform

    Juniper Networks has introduced a new solution that integrates security and networking management under a unified cloud and artificial intelligence engine.

  • A college student sits at a desk, surrounded by floating AI-themed tech images

    How AI Solutions Can Impact the Student End-User Experience

    DeVry University applies an iterative and holistic approach to integrating artificial intelligence into the classroom, with a focus on how it can enhance the overall student experience. Here are key things to consider when implementing AI at your institution.

  • close-up view of a heavily barricaded metal door with a large

    Kaspersky Closes Down U.S. Operations

    Security software company Kaspersky has announced it is ending its United States operations. The news comes just days before a federal ban on sales of its products was set to take effect, due to concerns about cyber espionage.

  • translucent lock composed of interconnected nodes and circuits at the center

    Cloud Security Alliance: Best Practices for Securing AI Systems

    The Cloud Security Alliance (CSA), a not-for-profit organization whose mission statement is defining and raising awareness of best practices to help ensure a secure cloud computing environment, has released a new report offering guidance on securing systems that leverage large language models (LLMs) to address business challenges.