Currently, only about two-thirds of U2 is powered up at any one time, since
the system is still going through testing. Once the Center gets the older system,
Joplin, updated with the same version of RedHat that U2 is running, says Miller,
it'll be integrated into U2's queuing system. "The [capacity] computing
jobs will be automatically routed to what was called Joplin-those nodes; the
[capability] computing jobs will be automatically routed to U2; and it'll be
transparent to the users. It'll be one queue that's essentially an Intel (www.intel.com)
Pentium line of chips." The only change for the users, explains Miller,
is that they'll need to answer a couple of extra questions about their projects
up front-such as "whether the code they need to run is 32-bit or 64-bit
compatible, which means it can run on both places-or if it's just 32-bit or
just 64-bit. And we'll be able to route things to the appropriate nodes."
Getting Access to the Power
To gain access, CCR has "basic minimal requirements," says Miller.
The user describes the project and required resources in terms of storage and
compute power. "We support research, discovery, and scholarship that requires
high-end computational resources. That's obviously a moving target." In
fact, he says, every six months or so, the center redefines what it means by
"high-end" in terms of data, networking, visualization, or computing
requirements. If a project requires fewer than, say, 16 processors running concurrently,
Miller and his team will probably kick the request back to the individual to
take back to his or her lab, department, or school for handling. An advisory
committee evaluates troublesome proposals, but most of the time, the decisions
are "obvious."
At NCSA, the process for gaining access is more formal. The proposal process
is modeled after the NSF process, says Towns, which involves-for large requests-a
peer review performed by a national review committee that meets quarterly. These
are proposals requiring in excess of 200,000 CPU hours per year. Smaller requests-from
10,000 to 20,000 hours of time-are considered "development accounts,"
for start-up projects. Reviewed and rewarded continuously, the smaller accounts
allow researchers to try out their applications on the system and understand
performance characteristics in preparation for submitting larger proposals.
What's a CPU hour in cluster terms? According to Towns, it's equivalent to
one hour of time on a single processor node. Since these are dual processor
nodes, there's a total of 2,480 processors on Tungsten. If a project is running
on 64 nodes, which is 128 processors, and it runs for one hour, the user has
accumulated 128 service units or CPU hours.
Neither organization charges its academic users for the time they use on the
clusters. In the case of NCSA, Towns says, they're granted allocations of time
as part of grant awards, and those allocations are billed from their usage.
CCR considers its clusters part of the "university infrastructure,"
says Miller, "to support leading-edge science research."
Both centers also attract funds from academic users in situations where they've
budgeted compute time in their grant proposals to cover expanded demands on
staff time; compensation also comes from commercial users that make use of the
resources.
The NCSA has about 100 staff members and another "15 or 20" graduate
students working in its center to provide 24/7 support for its community of
users. CCR has a technical team of 13, consisting of operations people and computational
scientists. The former, system administrators, keep the systems running, and
the latter work closely with users to figure out, for example, what applications
are needed for a particular project or to help optimize code.
Miller estimates that about half of the applications running on the CCR computers
are off-the-shelf-code that has been paid for or is freeware or shareware. The
other half is "home-grown." In the case of NCSA, Towns says, "By
and large [the majority of our users are faculty researchers] using applications
they've developed to solve the problems that they're attacking."
What It Takes To Work in Academic Computing
Every wondered what it takes to work in a high-performance computing environment?
According to John Towns, senior associate director, Persistent Infrastructure
Directorate at NCSA, there's no other job that can prepare you-you arrive with
particular qualifications then get on-the-job training.
At NCSA, to obtain a position on the academic professional staff, you're required
to have a Bachelor's in Science. Frequently, staff members have worked in a
research environment, possibly as a graduate student. Their backgrounds may
include hard sciences, or engineering, or computer science. All have "an
affinity for computers" and an interest in the high end-whether that is
"compute systems, networks, visualization, or data storage," Towns
says. Frequently, they don't work normal hours, which is an advantage in an
environment that runs 24 hours a day.
Shared-Memory Machines vs. Clusters
One misunderstanding that can crop up about clusters is that they replace the
old-style mainframe-type or mass-storage computers. In reality, each setup is
advantageous to a specific type of computing work. "Gene sequencing is
fairly trivially parallelized. You can spread it across a cluster and use the
resources well," Towns says. In other words, every processor is, practically
speaking, running a separate copy of the application, and no processor needs
to talk much to the other applications.
CCR's Miller refers to this as "capacity computing." In order for
a scientist to solve a certain problem, they may need to run a thousand or ten
thousand simulations, each of which is best run on a single CPU... It's the
aggregate of all those results that will solve their scientific problems."
Another class of programs runs as a single application. As Towns explains it,
"Imagine that you're running simulations, and... you create a grid that
represents a space and something happens in that space. Often where something
interesting is happening, you need to redefine the spacing on the grid to accurately
represent what's [going on]. A class of applications has been developed that
in a dynamic way redefines the spacing in the grid where it needs to... If you
try to represent the entire grid at the finest resolution, you don't have a
big enough memory machine to do it. What you do is refine it where it's necessary...
You have some nodes that have a lot of work to do and some that don't. In a
shared memory system, you can easily redistribute that work among the processors,
so you can keep them all busy together, and move the application along much
more quickly."
TeraGrid
A lot of research in grid computing is currently taking place. The TeraGrid
is an effort by the National Science Foundation (www.nsf.gov)
to build and deploy the world's largest distributed infrastructure for open scientific
research. In practical terms that means developing a common environment and interface
to the user community to a rather diverse set of high-performance computing resources-including
clusters. In some cases it also involves hooking up the physical systems to be
used in conjunction with each other and providing environments that are similar
so that researchers can move between systems more easily.
As John Towns, senior associate director, Persistent Infrastructure Directorate
at NCSA, explains, many researchers have multi-stage applications that require
different kinds of computing architectures to solve. "So the TeraGrid is
attempting to facilitate the use of these multiple architectures-often sited
at different locations-to support their research efforts."
Learn more about the TeraGrid at www.teragrid.org. Learn more about grid computing
at www.ggf.org.