Computer Years
Interestingly, both types of large systems have a fairly short life-about three
to five years, usually on the lower end, according to Towns. After that, the
maintenance and operational costs for the hardware becomes too high and it's
time to "simply buy a new system." The NCSA looks at a doubling of
computational resources every one to two years, and that's matched by demand.
That means the center is in some stage of acquiring new equipment every year.
NCSA expects to submit a proposal looking for either a $15 million or $30 million
system, with responses due in February. The winning bidder (Towns says there
are typically only three or four companies bidding) will be required to have
its solution in "substantive production state" by about March 2007.
From there, he says, "there's the whole procurement process, deployment,
testing, and putting it into production.
The installation and testing process has many stages and is time-consuming.
In the case of Tungsten, the Dell cluster at NCSA, the center received the hardware
over the course of two months. During that time, it was arriving on large trucks,
being unpacked, being set up, and then configured and tested. But that was preceded
by several months of software and hardware testing in-house at Dell.
Once the installation at the client site took place, NCSA did a lot of applications
testing to verify that the system actually worked. Then, before it went to production
state, the center opened up the equipment to what Towns calls the "friendly
user period." This lasts about two months and allows the general user community
to compile and test their applications. There's no charge to their time allocations
for this, but they also need to understand that it might be unstable until all
the issues are worked out. "It's a good way for us to shake down the system
before we go to production," says Towns.
For the cluster installation at CCR, Dell was the prime contractor, and it
subcontracted aspects of the project to other vendors, including Myricom Myrinet
(www.myri.com) network communications,
EMC (www.emc.com) storage arrays,
Force 10 Networks (www.force10networks.com)
for switch/routers, and IBRIX (www.ibrix.com)
for file storage. When it was delivered, the prime orchestrated installation,
"making sure the right people from the right vendor showed up at the right
time," says Miller. Why Dell? When CCR went out to bid, Miller recalls,
"We met with all the vendors and had discussions, and Dell was clearly
head and shoulders above anything we were looking for at that point in time."
Managing clusters is mostly automated through job schedulers, batch schedulers,
and other resource managers and monitors. At CCR, for example, "Larry"
and "Adam," the aforementioned administrative nodes, monitor all of
the cluster nodes and "continually ask them, 'Are you still up? Are you
still running? Are you healthy?'" says Miller. When problems arise-a file
system gets full or a network link g'es down-the human system administrators
get notified. If a node ceases to work or a power supply "explodes,"
he says, the job scheduler will continue scheduling, but not on that node.
The Edges Keeps Moving on High-Performance Computing
Clusters are by no means the final word in high-performance computing. As Towns
points out, each research community "has a set of applications that have
different system architecture requirements in order to execute and perform well."
Recently, the NSF issued a solicitation for systems totaling $30 million. In
performance terms, according to Miller, that's "roughly 10 times U2's speed."
Then four years from now they want it be "100 to a thousand times what
U2's speed is." As he points out, "There's simply no shortcut in terms
of solving some of these big-what they used to call "grand challenge"-problems
without big machines. If you're looking at whatever it may be-the physics of
the universe, or biochemical processes in the brain, or analyzing the spread
of infections... they just require massive amounts of computing power."
From humble beginnings as commodity devices, equipment that once only existed
on the desktop will continue proving its mettle in dazzling displays of high-performance
computing clusters.