Learning To Share

With a centrally managed data center, researchers can have their cake—supercomputing power at lower cost—and eat it too. Yet convincing them to pool resources can be a tricky political process

With a centrally managed data center, researchers can have their cake—supercomputing power at lower cost—and eat it too. Yet convincing them to pool resources can be a tricky political process

As 2010 began, the University of Washington was preparing to launch its first shared high-performance computing cluster, a 1,500-node system called Hyak, dedicated to research activities.

Like other research universities, UW had learned the hard way that having faculty members purchase and run their own high-performance computing nodes in multiple small data centers scattered across campus is inefficient and expensive. In many cases in those settings, the CPUs are active only 10 to 15 percent of the time, and, in the absence of coordination, the effectiveness of data centers drops off over time. These kinds of inefficiencies can impel research faculty to leave the university, resulting in data center capacity going unused.

“This was a problem that was costing the university many tens of millions of dollars,” says Chance Reschke, technical director at UW’s eScience Institute, which serves as a matchmaker between research faculty and resources, including computing platforms. Clustered nodes, he notes, enable researchers to take advantage of much greater computing power. Clusters also allow central IT organizations to offer better support, standardization, and coordination than individual colleges.

“Deans of research-heavy colleges, such as engineering, struggled to provide the facilities that faculty expected, such as power and cooling,” explains Reschke. “It became a faculty recruitment and retention issue for them. New faculty come to the university expecting to have access to this type of infrastructure. Colleges have no equipment available to satisfy that demand.”

But IT leaders at UW and other research universities say that leading the transition from scattered small research nodes to larger, centrally managed, shared resources is a tricky political process. It often requires CIOs to invest their own discretionary funds, to understand the needs and working styles of researchers, and to sell the concept to others on campus. It involves studying different business models and governance structures to determine which one would work best on a particular campus. Finally, it may require calling for some shared sacrifice.

“We had no fairy godmother sprinkling funding on this,” Reschke states. “Instead, upper-level management had to be willing to make an investment in infrastructure and management, and individual faculty members had to invest their grant funds in order to get their own personal supercomputer in a university-operated cloud—at half the expense of what it would cost them if they rolled their own.”

Building Trust

As planning for the Purdue University (IN) Steele community cluster began in the fall of 2007, the biggest challenge was to establish trust between IT and researchers, according to Gerard McCartney, Purdue’s CIO and vice president for information technology.

McCartney says that researchers often are inherently skeptical that a central IT organization can give them the service that they want, adding that “those misgivings aren’t unfounded. They are based on real experiences where they have been messed around with—[for example] machines sitting on loading docks while they wait for weeks for something to happen.” Some central IT organizations just don’t have strong project-management skills in support of research, he notes.

McCartney, who invested $1 million from his discretionary funds to support the Steele effort, cites a few decisions as key to the successful project management of the 893-node cluster, comprised of technology from Dell, Foundry Networks (now Brocade), and Tyco.

First, he involved researchers in the decision-making process every step of the way, including vendor decisions. After getting researchers to agree to buy from only one server vendor, McCartney’s team got bids from five vendors. He recalls holding a lunch meeting with pizza for the researchers and letting them choose which bid they liked. There were no vendor names attached. Researchers chose based on benchmarking data, prices, and warranty details. “I think the secret sauce,” he says, the pizza notwithstanding, “was letting them pick the vendor.”

Another key factor in building trust, according to McCartney, was to make cluster participation optional. Researchers could choose to buy off the cluster price menu and take charge of their own nodes, but then they would be responsible for service and repair. Only a few researchers did that the first year with the Steele project, and fewer did it the following year when another cluster was built.

“That’s really what we are looking for—two researchers at a cocktail party and one says to the other, ‘Are you really still running your own cluster?’” McCartney emphasizes. “That’s when it kicks in, when one colleague says something like that to another, not when an administrator tells him he should do something.”

Faculty Behind the Wheel

The worst-case scenario for a computing cluster is to have a build-it-andthey- will-come mentality—and then nobody comes. To ensure that wouldn’t happen at Emory University (GA), three years ago CIO Rich Mendola set up a joint task force between IT and the research community to develop a business case to spend $2 million in startup funds on a high-performance computing cluster.

After pushing through a difficult IT governance process, the project team won approval for the 256-node, 1,024- CPU Sun Microsystems computing cluster. As the project moved forward, Mendola saw that involving faculty in a steering committee that developed a memorandum of understanding (MOU) about costs and service levels paid off.

“This has to be faculty-driven,” he advises. “I would never have someone on my team lead the steering committee. It wouldn’t work. You have to make sure researchers’ voices are heard.”

The single most active discussion at Emory was about making sure the charge-back model worked. The steering committee studied a range of options from giving nodes away to fully recovering all the cluster’s costs, and decided on something in between. “To get buy-in, we needed cost to be below market, so no one could procure this for themselves as cheaply,” Mendola explains, “but we didn’t want to give it away. So we settled on tiered-usage subscription charges, and we have plugged some of the funding from that back into upgrading the cluster.”

The steering committee continues to study usage, billing models, and ideas for future clusters.

A Condo-Commons Model

Many universities have adopted a “condo” model to structure their cluster project, whereby the central IT organization houses and maintains computing clusters for researchers on campus. In return, spare cycles on these computer nodes are available for use by the research community (in condo parlance, a common area). But each university applies a slightly different business model to its condo structure.

For example, at Purdue the Steele cluster is a condo that is “owned” by the community of researchers. Each research group or department is allocated research time commensurate with its upfront funding contribution. This is analogous to condominium assessments calculated on a square-footage basis.

Rice University (TX) uses a different condo structure for its cluster, SUG@R (Shared University Grid at Rice), a 9- teraflop cluster of dual-quad core Sun Fire 4150s from Sun Microsystems. Unlike clusters purchased by aggregating funding from multiple researchers, SUG@R started with a commons, a shared pool providing high-performance computing to the entire campus, and then added nodes as dedicated condos for particular research needs. The price to faculty for having central IT manage their cluster’s dedicated condo is a 20 percent “tax” of its CPU cycles, the revenues from which provide additional computing cycles in the commons. The researchers can also take advantage of the commons’ increased computing power at scheduled times.

“In 2007, we started looking at a better way to approach research computing, but we had to come up with multiple incentives for IT, for the researchers, and for the university in order for it to be sustainable,” says Kamran Khan, vice provost for IT.

Reaching Out To Smaller Institutions

MOST SHARED HIGH-PERFORMANCE computing clusters involve IT organizations supporting researchers within the same university. But as Brown University in Providence, RI, worked to develop “Oscar,” a 166-node Linux cluster it purchased from IBM, it sought to include other institutions in the region.

Researchers at the University of Rhode Island and the Marine Biological Laboratory in Woods Hole, MA, have been among the first users of the cluster launched in November 2009.

“From the beginning of our planning, the emphasis was on trying to build relationships beyond the Brown campus,” says Jan Hesthaven, professor of applied mathematics and director of Brown’s Center for Computation and Visualization. “Ultimately, we hope to reach out to smaller colleges in Rhode Island for both research and educational purposes. It’s just a matter of getting the word out.”

Researchers at URI now have access to far greater computing power than is available on their own campus. Engineering researchers are using Oscar to help understand why and how cracks develop in different materials, and the cluster helps URI biologists study genetic coding.

Oscar is already at 75 percent capacity after only a few months in operation, and Hesthaven thinks it will be at full capacity soon. “Computers like these are like ice cream shops,” he notes. “They are always busy.”

Currently, external users of Oscar pay the same amount for access as Brown researchers do, plus associated overhead. But Brown also offers its own faculty members condo agreements in which researchers co-locate their equipment in its centralized facilities. Eventually, Hesthaven wants to extend that offer to other schools. “Obviously, charging external researchers’ schools for their share of the cost of space, power, and cooling gets tricky,” he says, “but that is our long-term goal.”

A steering committee of 19 people, both researchers and IT staff, hammered out a business model and a memorandum of understanding for each researcher to sign. The deal offers Rice researchers more computing power at lower cost. If the total cost of four nodes, including shipping, cabling, switches, maintenance, and so forth, is $18,000, the Rice researcher might pay $13,500, with central IT picking up the rest of the cost, Khan says. Because the first SUG@R condo project was successful, Rice has launched a second, and it is studying a third. It has almost 400 condo accounts.

“We definitely see efficiencies in keeping systems up and running, in costs, and in offering more CPU cycles,” Khan says.

Competing For Resources

University CIOS can make a good case for investing in shared high-performance computing clusters. The systems make computing cycles less expensive, and researchers have the ability to run jobs beyond their allotment when other researchers’ computing resources are idle.

The cluster approach also gives universities such as Purdue (IN) an advantage when recruiting top faculty, says CIO Gerard McCartney, and access to high-end equipment looks good on researchers’ grant applications. The faculty investors in the school’s Steele cluster generated more than $73 million in external funding last year.

Nevertheless, devoting scarce university IT funding to researchers who already have federal grant money—and some of whom have access to national supercomputing facilities—can be a tough sell to IT governance boards and faculty senates. Traditionally, these researchers get federal funding to buy and run their own computers with no money from the university system. Shared-cluster proposals are asking the university to put up millions of dollars on top of what the researchers already get from the feds, to develop more infrastructure, increase IT support, and help researchers get more bang for their buck. Supporting high-end research may mean there is less money available for other campus IT projects. But as McCartney says, “We have to make choices. Two things are crucial: research support and teaching support.”

The University of California system went ahead with a shared research-computing pilot project last year, but not before hearing objections from its University Committee on Computing and Communications. A committee letter dated March 31, 2009, expressed concern that the $5.6 million expenditure would benefit a relatively small number of researchers. “Considering the severe cutbacks the University is facing,” the committee wrote, “this funding could be better used elsewhere. For example, it could be applied to implementing minimum connectivity standards for all UC faculty and staff.”

At Emory University (GA), despite commitments from many researchers and departments, CIO Rich Mendola faced faculty opposition to the centralized high-performance computing cluster project during the regular IT governance process. “They see it as a zero-sum game,” Mendola says, “so if the money is being spent here, it’s not being spent somewhere else.” Even after hearing about the advantages of shared clusters, some still don’t understand why the researchers just don’t buy their own high-performance computers, he laments.

An ongoing steering committee oversees such issues as queuing policies, prescriptions of minimum hardware requirements, day-to-day operational review, and recommendations for changes to the MOU. “The steering committee gives researchers a place to work with us and look at requirements as technology changes,” Khan explains. “It has created a collaborative environment between IT and the research community.”

Among other benefits, Khan notes that many researchers have shut down their own dedicated clusters in favor of SUG@R’s condos, thereby decreasing cooling and power expense and achieving economies of scale in system administration.

In addition, researchers who felt parallel computing was beyond their means enjoy 10-factor performance gains in the commons.

There’s no doubt that a high-performance cluster model saves money, drives efficiencies, and forges stronger bonds between IT and research communities. From a long-term investment point of view, shared clusters also help retain and attract new faculty. Emory University’s Mendola, for instance, says he has spoken to new faculty members who have come from universities where they’ve had to set up and run their own small cluster. “They are thrilled to come to Emory and within two weeks run their computing jobs on our shared cluster,” he says. “That has definitely become a recruiting tool.”

Featured