Purdue Develops Technique to Keep Computers Running in Overheated Data Center

To stay safe in hot conditions, experts recommend that workers move more slowly. The same advice fits data centers, apparently. To prevent outages and keep work moving, Purdue University has successfully tested out a technique for controlling operations of its computing clusters in overheating conditions by slowing down the performance of its nodes.

That's proving a boon to the computing operations that researchers rely on at the Rosen Center for Advanced Computing at the Lafayette, IN institution. The center provides computing infrastructure services to researchers on campus and around the country. Frequently, those research projects require months of continuous computing time on thousands of processors. If something happens to shut down the computer operations while a massive multi-month calculation is being performed, the job usually has to start again from the beginning. In other words, an outage is "guaranteed" to affect many groups on campus, according to Patrick Finnegan, Unix systems administrator in Rosen's IT Systems and Operations group.

Power outages are actually infrequent at the data center, Finnegan said. But he added that this summer, "due to some planned cooling system maintenance, coupled with the unusually hot summer, we have had some brief cooling outages." When the temperature in the data center exceeds a certain point, the racks of computers have to be shut down. If that's not done intentionally, they'll shut themselves down. And that, said Mike Shuey, high-performance computing systems manager, "has ripple effects on the research efforts of the university for weeks afterward."

Twice so far this summer shutdowns have been called for, according to Finnegan. "In both instances, the cause was a temporary capacity reduction in the campus chilled water supply." That 50-degree water supply cools the entire facility, which includes about 15,000 processors, along with other computing systems in use in the space. When the cooling system is turned off, temperatures in the room can reach in the high 80s and 90s.

To address the planned outages, Finnegan developed a technique that allows the center to continue operating the computers--though at a reduced performance level. "Basically, I use the power saving features present in almost all modern systems to slow down the system, while at the same time reducing power and cooling usage," he explained. "This is similar to how your laptop saves power to extend battery life. Then when things are back to normal, we just turn the systems back up to full speed, and everything takes off like normal."

After Finnegan's system was implemented, he said, the temperature sensors informed the IT crew that the temperatures were going down again. In one instance, on an AMD-based cluster specifically, power usage dropped from an average of about 290 kilowatts to about 205 kW, with a performance decrease of between 50 percent and 70 percent.

"The program worked, and the datacenter didn't overheat, so the process was a success. We actually were a bit surprised it worked so seamlessly," said Shuey. "It's much better to have jobs run slowly for an hour than to throw away everyone's work in progress and mobilize staff to try to fix things."

With the successful use of his scheme, Finnegan became a datacenter hero. "I was a bit overwhelmed by all of the positive responses I got," he said.

The Purdue crew has written up its procedures and is making them available through a storefront on foliodirect, a Web site that sells licensable university technology. "High Performance Computing Power Saving Device" is priced at $250.

According to an abstract for the report, the software runs on most distributions of Linux running x86 64-bit AMD and Intel processors.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • large group of college students sitting on an academic quad

    Student Readiness: Learning to Learn

    Melissa Loble, Instructure's chief academic officer, recommends a focus on 'readiness' as a broader concept as we try to understand how to build meaningful education experiences that can form a bridge from the university to the workplace. Here, we ask Loble what readiness is and how to offer students the ability to 'learn to learn'.

  • robots organizing stacks of papers

    An AI Adoption Imperative: Centralized Sources of Governed Truth

    Strategies for enterprise teams who aim to build a data foundation to move the institution from AI experimentation to real-world execution.

  • SXSW EDU

    SXSW EDU 2026: Discover How to Incorporate Technology with Impact

    With the proliferation of AI and advanced technology, education leaders have an opportunity to find and implement the right solutions to make a difference for learners. This March 9-12, SXSW EDU 2026 is your chance to discover innovative edtech, connect with trailblazing peers, and find strategies that make an impact.

  • futuristic representation of interconnected individuals within a digital network

    OpenAI Launches Safety Fellowship to Fund External AI Research

    OpenAI is expanding safety efforts beyond its walls with a new Safety Fellowship that will fund external researchers to study AI risks.