Learning To Share
With a centrally managed data center, researchers can have their cake—supercomputing power at lower cost—and eat it too. Yet convincing them to pool resources can be a tricky political process
With a centrally managed data center, researchers can have their
cake—supercomputing power at lower cost—and eat it too. Yet
convincing them to pool resources can be a tricky political process
As 2010 began, the University of Washington was preparing to launch its
first shared high-performance computing cluster, a 1,500-node system called Hyak,
dedicated to research activities.
Like other research universities, UW had learned the hard way that having faculty
members purchase and run their own high-performance computing nodes in
multiple small data centers scattered across campus is inefficient and expensive.
In many cases in those settings, the CPUs are active only 10 to 15 percent of the
time, and, in the absence of coordination, the effectiveness of data centers drops
off over time. These kinds of inefficiencies can impel research faculty to leave
the university, resulting in data center capacity going unused.
“This was a problem that was costing the university many tens of millions of dollars,”
says Chance Reschke, technical director at UW’s eScience Institute, which serves as a
matchmaker between research faculty and resources, including computing platforms.
Clustered nodes, he notes, enable researchers to take advantage of much greater computing
power. Clusters also allow central IT organizations to offer better support, standardization,
and coordination than individual colleges.
“Deans of research-heavy colleges, such as engineering, struggled to provide the facilities
that faculty expected, such as power and cooling,” explains Reschke. “It became a
faculty recruitment and retention issue for them. New faculty come to the university
expecting to have access to this type of infrastructure. Colleges have no equipment available
to satisfy that demand.”
But IT leaders at UW and other
research universities say that leading the
transition from scattered small research
nodes to larger, centrally managed,
shared resources is a tricky political
process. It often requires CIOs to invest
their own discretionary funds, to understand
the needs and working styles of
researchers, and to sell the concept to
others on campus. It involves studying
different business models and governance
structures to determine which one
would work best on a particular campus.
Finally, it may require calling for some
shared sacrifice.
“We had no fairy godmother sprinkling
funding on this,” Reschke states.
“Instead, upper-level management had
to be willing to make an investment in
infrastructure and management, and
individual faculty members had to
invest their grant funds in order to get
their own personal supercomputer in a
university-operated cloud—at half the
expense of what it would cost them if
they rolled their own.”
Building Trust
As planning for the Purdue University
(IN) Steele community cluster began in
the fall of 2007, the biggest challenge
was to establish trust between IT and
researchers, according to Gerard McCartney,
Purdue’s CIO and vice president for
information technology.
McCartney says that researchers
often are inherently skeptical that a central
IT organization can give them the
service that they want, adding that
“those misgivings aren’t unfounded.
They are based on real experiences
where they have been messed around
with—[for example] machines sitting
on loading docks while they wait for
weeks for something to happen.” Some
central IT organizations just don’t have
strong project-management skills in
support of research, he notes.
McCartney, who invested $1 million
from his discretionary funds to support
the Steele effort, cites a few decisions as
key to the successful project management
of the 893-node cluster, comprised
of technology from Dell, Foundry Networks
(now Brocade), and Tyco.
First, he involved researchers in the
decision-making process every step of
the way, including vendor decisions.
After getting researchers to agree to
buy from only one server vendor,
McCartney’s team got bids from five
vendors. He recalls holding a lunch
meeting with pizza for the researchers
and letting them choose which bid they
liked. There were no vendor names
attached. Researchers chose based on
benchmarking data, prices, and warranty
details. “I think the secret sauce,” he
says, the pizza notwithstanding, “was
letting them pick the vendor.”
Another key factor in building trust,
according to McCartney, was to make
cluster participation optional. Researchers
could choose to buy off the cluster
price menu and take charge of their
own nodes, but then they would be
responsible for service and repair. Only
a few researchers did that the first year
with the Steele project, and fewer did it
the following year when another cluster
was built.
“That’s really what we are looking
for—two researchers at a cocktail party
and one says to the other, ‘Are you really
still running your own cluster?’”
McCartney emphasizes. “That’s when it
kicks in, when one colleague says something
like that to another, not when an
administrator tells him he should do
something.”
Faculty Behind the Wheel
The worst-case scenario for a computing
cluster is to have a build-it-andthey-
will-come mentality—and then
nobody comes. To ensure that wouldn’t
happen at Emory University (GA),
three years ago CIO Rich Mendola set
up a joint task force between IT and the
research community to develop a business
case to spend $2 million in startup
funds on a high-performance computing
cluster.
After pushing through a difficult IT
governance process, the project team
won approval for the 256-node, 1,024-
CPU Sun Microsystems computing
cluster. As the project moved forward,
Mendola saw that involving faculty in a
steering committee that developed a
memorandum of understanding (MOU)
about costs and service levels paid off.
“This has to be faculty-driven,” he
advises. “I would never have someone
on my team lead the steering committee.
It wouldn’t work. You have to make sure
researchers’ voices are heard.”
The single most active discussion at
Emory was about making sure the
charge-back model worked. The steering
committee studied a range of
options from giving nodes away to fully
recovering all the cluster’s costs, and
decided on something in between. “To
get buy-in, we needed cost to be below
market, so no one could procure this
for themselves as cheaply,” Mendola
explains, “but we didn’t want to give it
away. So we settled on tiered-usage subscription
charges, and we have plugged
some of the funding from that back into
upgrading the cluster.”
The steering committee continues to
study usage, billing models, and ideas
for future clusters.
A Condo-Commons Model
Many universities have adopted a
“condo” model to structure their cluster
project, whereby the central IT organization
houses and maintains computing
clusters for researchers on campus. In
return, spare cycles on these computer
nodes are available for use by the
research community (in condo parlance,
a common area). But each university
applies a slightly different business
model to its condo structure.
For example, at Purdue the Steele
cluster is a condo that is “owned” by the
community of researchers. Each research
group or department is allocated
research time commensurate with its
upfront funding contribution. This is
analogous to condominium assessments
calculated on a square-footage basis.
Rice University (TX) uses a different
condo structure for its cluster, SUG@R
(Shared University Grid at Rice), a 9-
teraflop cluster of dual-quad core Sun
Fire 4150s from Sun Microsystems.
Unlike clusters purchased by aggregating
funding from multiple researchers,
SUG@R started with a commons, a
shared pool providing high-performance
computing to the entire campus, and
then added nodes as dedicated condos
for particular research needs. The price
to faculty for having central IT manage
their cluster’s dedicated condo is a 20
percent “tax” of its CPU cycles, the revenues
from which provide additional
computing cycles in the commons. The
researchers can also take advantage of
the commons’ increased computing
power at scheduled times.
“In 2007, we started looking at a better
way to approach research computing,
but we had to come up with multiple
incentives for IT, for the researchers,
and for the university in order for it to
be sustainable,” says Kamran Khan, vice provost for IT.
Reaching Out To Smaller Institutions
MOST SHARED HIGH-PERFORMANCE computing clusters involve IT organizations supporting
researchers within the same university. But as Brown University in Providence, RI, worked to
develop “Oscar,” a 166-node Linux cluster it purchased from IBM, it sought to include other
institutions in the region.
Researchers at the University of Rhode Island and the Marine
Biological Laboratory in Woods Hole, MA, have been among the
first users of the cluster launched in November 2009.
“From the beginning of our planning, the emphasis was on trying
to build relationships beyond the Brown campus,” says Jan Hesthaven,
professor of applied mathematics and director of Brown’s
Center for Computation and Visualization. “Ultimately, we hope to
reach out to smaller colleges in Rhode Island for both research and
educational purposes. It’s just a matter of getting the word out.”
Researchers at URI now have access to far greater computing power than is available on
their own campus. Engineering researchers are using Oscar to help understand why and how
cracks develop in different materials, and the cluster helps URI biologists study genetic coding.
Oscar is already at 75 percent capacity after only a few months in operation, and Hesthaven
thinks it will be at full capacity soon. “Computers like these are like ice cream shops,” he notes.
“They are always busy.”
Currently, external users of Oscar pay the same amount for access as Brown researchers
do, plus associated overhead. But Brown also offers its own faculty members condo agreements
in which researchers co-locate their equipment in its centralized facilities. Eventually,
Hesthaven wants to extend that offer to other schools. “Obviously, charging external
researchers’ schools for their share of the cost of space, power, and cooling gets tricky,” he
says, “but that is our long-term goal.”
A steering committee of 19 people,
both researchers and IT staff, hammered
out a business model and a memorandum
of understanding for each researcher
to sign. The deal offers Rice researchers
more computing power at lower cost. If
the total cost of four nodes, including
shipping, cabling, switches, maintenance,
and so forth, is $18,000, the Rice
researcher might pay $13,500, with central
IT picking up the rest of the cost,
Khan says. Because the first SUG@R
condo project was successful, Rice has
launched a second, and it is studying a
third. It has almost 400 condo accounts.
“We definitely see efficiencies in
keeping systems up and running, in
costs, and in offering more CPU
cycles,” Khan says.
Competing For Resources
University CIOS can make a good case for
investing in shared high-performance computing
clusters. The systems make computing cycles
less expensive, and researchers have the ability
to run jobs beyond their allotment when other
researchers’ computing resources are idle.
The cluster approach also gives universities
such as Purdue (IN) an advantage when recruiting
top faculty, says CIO Gerard McCartney, and access
to high-end equipment looks good on researchers’
grant applications. The faculty investors in the
school’s Steele cluster generated more than $73
million in external funding last year.
Nevertheless, devoting scarce university IT funding
to researchers who already have federal grant
money—and some of whom have access to national
supercomputing facilities—can be a tough sell
to IT governance boards and faculty senates. Traditionally,
these researchers get federal funding to
buy and run their own computers with no money
from the university system. Shared-cluster proposals
are asking the university to put up millions of
dollars on top of what the researchers already get
from the feds, to develop more infrastructure,
increase IT support, and help researchers get more
bang for their buck. Supporting high-end research
may mean there is less money available for other
campus IT projects. But as McCartney says, “We
have to make choices. Two things are crucial:
research support and teaching support.”
The University of California system went ahead
with a shared research-computing pilot project last
year, but not before hearing objections from its University
Committee on Computing and Communications.
A committee letter dated March 31, 2009,
expressed concern that the $5.6 million expenditure
would benefit a relatively small number of
researchers. “Considering the severe cutbacks the
University is facing,” the committee wrote, “this
funding could be better used elsewhere. For example,
it could be applied to implementing minimum
connectivity standards for all UC faculty and staff.”
At Emory University (GA), despite commitments
from many researchers and departments, CIO Rich
Mendola faced faculty opposition to the centralized
high-performance computing cluster project during
the regular IT governance process. “They see
it as a zero-sum game,” Mendola says, “so if the
money is being spent here, it’s not being spent
somewhere else.” Even after hearing about the
advantages of shared clusters, some still don’t
understand why the researchers just don’t buy their
own high-performance computers, he laments.
An ongoing steering committee oversees
such issues as queuing policies,
prescriptions of minimum hardware
requirements, day-to-day operational
review, and recommendations for
changes to the MOU. “The steering
committee gives researchers a place
to work with us and look at requirements
as technology changes,” Khan
explains. “It has created a collaborative
environment between IT and the
research community.”
Among other benefits, Khan notes
that many researchers have shut down
their own dedicated clusters in favor of
SUG@R’s condos, thereby decreasing
cooling and power expense and achieving
economies of scale in system administration.
In addition, researchers who
felt parallel computing was beyond their
means enjoy 10-factor performance
gains in the commons.
There’s no doubt that a high-performance
cluster model saves money, drives
efficiencies, and forges stronger
bonds between IT and research communities.
From a long-term investment
point of view, shared clusters also help
retain and attract new faculty. Emory
University’s Mendola, for instance, says
he has spoken to new faculty members
who have come from universities where
they’ve had to set up and run their own
small cluster. “They are thrilled to come
to Emory and within two weeks run
their computing jobs on our shared
cluster,” he says. “That has definitely
become a recruiting tool.”