Purdue Develops Technique to Keep Computers Running in Overheated Data Center

To stay safe in hot conditions, experts recommend that workers move more slowly. The same advice fits data centers, apparently. To prevent outages and keep work moving, Purdue University has successfully tested out a technique for controlling operations of its computing clusters in overheating conditions by slowing down the performance of its nodes.

That's proving a boon to the computing operations that researchers rely on at the Rosen Center for Advanced Computing at the Lafayette, IN institution. The center provides computing infrastructure services to researchers on campus and around the country. Frequently, those research projects require months of continuous computing time on thousands of processors. If something happens to shut down the computer operations while a massive multi-month calculation is being performed, the job usually has to start again from the beginning. In other words, an outage is "guaranteed" to affect many groups on campus, according to Patrick Finnegan, Unix systems administrator in Rosen's IT Systems and Operations group.

Power outages are actually infrequent at the data center, Finnegan said. But he added that this summer, "due to some planned cooling system maintenance, coupled with the unusually hot summer, we have had some brief cooling outages." When the temperature in the data center exceeds a certain point, the racks of computers have to be shut down. If that's not done intentionally, they'll shut themselves down. And that, said Mike Shuey, high-performance computing systems manager, "has ripple effects on the research efforts of the university for weeks afterward."

Twice so far this summer shutdowns have been called for, according to Finnegan. "In both instances, the cause was a temporary capacity reduction in the campus chilled water supply." That 50-degree water supply cools the entire facility, which includes about 15,000 processors, along with other computing systems in use in the space. When the cooling system is turned off, temperatures in the room can reach in the high 80s and 90s.

To address the planned outages, Finnegan developed a technique that allows the center to continue operating the computers--though at a reduced performance level. "Basically, I use the power saving features present in almost all modern systems to slow down the system, while at the same time reducing power and cooling usage," he explained. "This is similar to how your laptop saves power to extend battery life. Then when things are back to normal, we just turn the systems back up to full speed, and everything takes off like normal."

After Finnegan's system was implemented, he said, the temperature sensors informed the IT crew that the temperatures were going down again. In one instance, on an AMD-based cluster specifically, power usage dropped from an average of about 290 kilowatts to about 205 kW, with a performance decrease of between 50 percent and 70 percent.

"The program worked, and the datacenter didn't overheat, so the process was a success. We actually were a bit surprised it worked so seamlessly," said Shuey. "It's much better to have jobs run slowly for an hour than to throw away everyone's work in progress and mobilize staff to try to fix things."

With the successful use of his scheme, Finnegan became a datacenter hero. "I was a bit overwhelmed by all of the positive responses I got," he said.

The Purdue crew has written up its procedures and is making them available through a storefront on foliodirect, a Web site that sells licensable university technology. "High Performance Computing Power Saving Device" is priced at $250.

According to an abstract for the report, the software runs on most distributions of Linux running x86 64-bit AMD and Intel processors.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • illustration of a human head with a glowing neural network in the brain, connected to tech icons on a cool blue-gray background

    Meta Launches Stand-Alone AI App

    Meta Platforms has introduced a stand-alone artificial intelligence app built on its proprietary Llama 4 model, intensifying the competitive race in generative AI alongside OpenAI, Google, Anthropic, and xAI.

  • cybersecurity analyst in a modern operations center monitors multiple digital screens showing padlock icons, graphs, and a global map with security markers

    Louisiana State University Doubles Down on Larger Student-Run SOC

    In an effort to provide students with increased access to real-world cybersecurity experience, Louisiana State University has expanded its relationship with cybersecurity solutions provider TekStream to launch TigerSOC, a new student-run security operations center.

  • AI microchip under cybersecurity attack, surrounded by symbols of threats like a skull, spider, lock, and warning shield

    Report: Agentic AI Protocol Is Vulnerable to Cyber Attacks

    A new report has identified significant security vulnerabilities in the Model Context Protocol (MCP), technology introduced by Anthropic in November 2024 to facilitate communication between AI agents and external tools.

  • magnifying glass with AI icon in the center

    Google Intros Learning-Themed AI Mode Features for Search

    Google has announced new AI Mode features in Search, including image and PDF queries on desktop, a Canvas tool for planning, real-time help with Search Live, and Lens integration in Chrome. Features are launching in the U.S. ahead of the school year.