Supercomputing Is Here!
For some colleges and universities, computing is now
soaring into the stratosphere.
Mike Hickey at Embry-Riddle Aeronautical University
is using a brand-new supercomputer to learn
more about acoustic-gravity waves in the upper
portions of Earth’s atmosphere— waves which
ultimately impact flying conditions.
When most academic technologists
tackle computing, their time is occupied by laptops and
servers—relatively small-scale stuff. A couple of blades
here, a couple of blades there. Generally, even for network
managers, the processing power rarely stacks up to anything
awe-inspiring. Sometimes, however, computing can be bigger
and broader than many of us can imagine, requiring
more juice than some small nations use in a year. Then, of
course, we find ourselves in the 21st-century realm of highperformance
computing. High-performance computing
efforts at four schools—
Indiana University, the
University
of Florida, the
University of Utah, and
Embry-Riddle
Aeronautical University (FL)—demonstrate that the latest
and greatest in supercomputing on the academic level
far exceeds the computing power that most of us can conceive.
As computing power continues to grow, however,
these tales undoubtedly are only the beginning.
Forecasting Improvements
In the wake of the devastation caused by Hurricane Katrina,
a number of today’s most elaborate high-performance
computing endeavors revolve around finding better ways to
predict everything from everyday showers and snowstorms
to devastating hurricanes and tornad'es. At Indiana University,
researchers in the School of Informatics are using
their high-speed computing and network infrastructure to
help meteorologists make more timely and accurate forecasts
of dangerous weather conditions. The project, which
kicked off with an $11 million grant from the National Science
Foundation in 2003, is dubbed Linked
Environments for Atmospheric Discovery (LEAD).
The LEAD system runs on a series of remote, distributed
supercomputers—a method known as grid computing.
Co-principal Investigator Dennis Gannon, who also serves
as a professor of Computer Science at the university, says
the project is designed to build a “faster than real time”
system that could save lives and help governments better
prepare themselves for looming natural disasters. Today,
the project is in its infancy—as yet in the planning and testing
stages. Ultimately, however, Gannon sees the model
outpacing the current strategy for weather prediction—a
system that, despite constant improvements, still runs
largely on simulations.
“Our goal is nothing short of building an adaptive, ondemand
computer and network infrastructure that responds
to complex weather-driven events,” he says. “We hope to use
this technology to make sure storms never cripple us again.”
The LEAD system pools and analyzes data received from
other sources such as satellites, visual reports from commercial
pilots, and NEXRAD, a network of 130 national
radars that detect and process changing
weather conditions. Down the road, an
armada of newer and smaller ground
sensors dispatched to detect humidity,
wind, and lightning strikes will be part
of the network, too. As weather information
comes in, it is interpreted by special
software agents that are monitoring
the data for certain dangerous patterns.
Once these patterns are identified, the
agents will dispatch the data to a variety
of high-performance computers across
private networks for real-time processing
and evaluation.
In most cases, these collection devices
will send weather data out for processing on computers in the IU network. Sometimes,
however, spurred by an additional
$2 million grant from the National Science
Foundation (NSF),
the software agents will dispatch data to
computers on a broader distributed computing
network known as TeraGrid. The grid is a national
network that allows scientists across
the nation to share data and collaborate.
Under this system, huge computers in San
Diego, Indiana, and Pittsburgh are linked
via a 20GB connection rate, to facilitate
cooperation. The result: several thousand
processors at a participating school’s fingertips,
on demand.
“My files may come from San Diego
and my computing facility may be in
Pittsburgh, but the network is such that
the facility in Pittsburgh d'esn’t care
where the data’s from,” explains Gannon.
“As you can see, this kind of leverage
opens up a host of new doors in
terms of what kind of weather data we
can process with supercomputers, and
how we can do it.”
The University of Florida’s new cluster
boasts tightly coupled nodes that act
like a single computer; no more
disparate calculations attempting to
handle huge computations.
Clusters: Loose or Tightly Coupled?
The science of supercomputer processing
is far from easy. Generally, the
“engines” of supercomputers are gaggles
of processing power called “nodes.”
Each node is comprised of a series of
processors, and each processor differs
from the others depending on how
sophisticated the supercomputer is. By
and large, these nodes usually boast two
to four central processing units (CPUs)
with up to 4GB of RAM. To put this into
perspective, the best nodes basically are
the same as four really expensive personal
computers. And most supercomputers
have at least 100 of these
nodes—the equivalent of 400 of the
fastest and most efficient computers
money can buy.
Still, not all supercomputers are created
equal. Technologists at the High
Performance Center (HPC) at the University
of Florida recently unveiled a
brand-new cluster: a 200-node supercomputer
that’s bigger than anything the
school has had before. With the speed of
this machine, HPC Director Erik Deumens
says UF researchers will embark
on new projects to investigate the properties
of molecular dynamics, the ins
and outs of aerodynamic engineering,
and climate modeling projects of their
own. Deumens calls the bulk of these
projects “multi-scale”—an approach
that takes into account mathematical
“You try to describe a certain piece of
a problem with a particular methodology,
but you know that in some part,
something more interesting is happening,
so you use the magnifying glass of
advanced calculation,” he says, pointing
to one researcher who is studying the
molecular interaction of heated silicon
when engineers etch microchips. “The
only way to make sure you’re not overwhelmed
with numbers is to look at
problems with enough computing power
to answer multiple questions at once.”
Deumens notes that with a 200-node
supercomputer, connections between
nodes are critical. To make sure the
machine functions properly, HPC
turned to Cisco Systems for all of the networking connections
between nodes. According to Marc
Hoit, interim associate provost for Information
Technology, the vendor also is
helping UF connect all of its clusters on
campus so HPC can perform more gridbased
computations. All told, Hoit estimates
that soon, more than 3,000 CPUs
will be part of this grid. UF also will
contribute to the Open Science Grid, an international
infrastructure in the vein of TeraGrid, though considerably larger.
One key differentiator between the
expanded grid and UF’s new cluster is
the way in which the nodes are coupled.
In the UF cluster, the nodes are more
tightly coupled, meaning that they act
more like one single computer, and can
be harnessed to perform large-scale
mathematical computations quickly, as
one unit. In the grid, however, the nodes
are loosely coupled, meaning they all
have other computing responsibilities,
and likely are not ready to handle huge
calculations at one time. The “loosely
coupled” strategy is perfect for small
equations with small sets of data, such
as genetic calculations. For weather and
aerodynamic simulations, however,
Hoit notes the tightly coupled approach
is a must.
“Many people often make simple
statements about CPUs, and say they
can apply all the power to solving one
large problem,” he says. “But a tightly
coupled approach can help achieve efficiency
for large problems, too.”
WHAT IS MYRINET?
Myrinet is a networking system designed
by Myricom; it has fewer
protocols than Ethernet, and so is faster
and more efficient. Physically, Myrinet
consists of two fiber optic cables,
upstream and downstream, connected
to the host computers with a single
connector. Machines are connected via
low-overhead routers and switches, as
opposed to connecting one machine
directly to another. Myrinet includes a
number of fault-tolerance features,
mostly backed by the switches. These
include flow control, error control, and
“heartbeat” monitoring on every link.
The first generation provided 512
Megabit data rates in both directions,
and later versions supported 1.28 and
2 Gigabits. Newest “Fourth-generation
Myrinet” supports a 10 Gigabit data
rate, and is interoperable with 10
Gigabit Ethernet. These products started
shipping in September 2005.
Myrinet’s throughput is close to the
theoretical maximum of the physical
layer. On the latest 2.0 Gigabit links,
Myrinet often runs at 1.98 Gigabits of
sustained throughput—considerably
better than what Ethernet offers, which
varies from 0.6 to 1.9 Gigabits,
depending on load. However, for supercomputing,
the low latency of Myrinet
is even more important than its throughput
performance, since, according to
Amdahl’s Law, a high-performance parallel
system tends to be bottlenecked
by its slowest sequential process,
which is often the latency of transmission
of messages across the network in
all but the most embarrassingly parallel
supercomputer workloads.
And Now for the Metacluster
Running a giant cluster like the one at
UF is remarkable, but imagine operating
something five times that big. Such
is life for Julio Facelli, director of the
Center for High-Performance Computing
(CHPC) at the University of Utah.
The Center is responsible for providing
high-end computer services to advanced
programs in computational sciences and
simulations. Recently, via a grant from
the National Institutes of Health (NIH), CHPC has purchased a
metacluster to tackle the new generation
of bioinformatics applications, which
comprise the nitty-gritty study of genetic
code and similarly complicated equations.
Facelli says the machine is one
of the largest of its kind in the academic
world.
The new metacluster boasts more
than 1,000 64-bit processors in dense
blades from Angstrom Microsystems. The metacluster
has been configured into five subsystems,
including a parallel cluster with
256 dual nodes, a “cycle farm” cluster
with 184 dual nodes, a data-mining
cluster with 48 dual nodes, a long-term
file system with 15 terabytes of storage,
and a visualization cluster driven by a
10-node cluster. Most of these clusters
are connected by Gigabit Ethernet. The
parallel cluster runs on Myrinet, a networking
system designed by Myricom that has fewer protocols
than Ethernet, and therefore is
faster and more efficient. (See box above, “What is Myrinet?”)
“When we started this program eight
or nine years ago, it was possible only to
do simulations in one dimension,” says
Facelli. “Today, we are computing in
three dimensions and running calculations
that we never dreamed of being
able to run.”
In addition to these resources, the
metacluster boasts a “condominium”-
style sub-cluster in which additional
capacity can be added for specific
research projects. With this feature,
Facelli says CHPC uses highly advanced
scheduling techniques to provide seamless
access to heterogeneous computer
resources necessary for an integrated
approach to scientific research in the
areas of fire and meteorology simulations,
spectrometry, engineering, and
more. Additional specialized servers are
available for specific applications such
as large-scale statistics, molecular modeling,
and searches in GenBank, a database
of genetic data. CHPC is developing
several cluster test beds to implement
grid computing, too.
Down the road, CHPC plans to add
two dual nodes to its “condominium”-
style cluster for a proposed study of
patient adherence to poison control
referral recommendations. As Facelli
explains it, the school seeks to use
machine learning methods for feature
selection and predictive modeling—an
enterprising approach, considering that
prior to this, no researcher or research
institution had ever implemented highperformance
computing to accomplish
such a challenge. CHPC will support
these nodes for the duration of the project
and will collaborate with the project investigators in the computational aspects
of the research. Afterward, Facelli says,
the nodes will be subsumed back into
the system.
“You can never have too many nodes
in your metacluster,” he quips. “We’re
excited about the possibilities of what
these will bring.”
WHAT IS BEOWULF?
Aside from being a classic work of
early literature, Beowulf is a type of
computing cluster. The label is a
design for high-performance parallel
computing clusters on inexpensive
personal computer hardware. Originally
developed by Donald Becker at NASA,
Beowulf systems are now deployed
worldwide, chiefly in support of
scientific computing. There is no particular
piece of software that defines a
cluster as a Beowulf. Commonly used
parallel processing libraries include
Message Passing Interface (MPI) and
Parallel Virtual Machine (PVM), a software
tool developed by the University
of Tennessee, Emory University (GA),
and The Oak Ridge National Laboratory. Both of these permit
the programmer to divide a task among
a group of networked computers, and
recollect the results of processing.
Flying High
Another school that has researchers
excited about the future is Embry-Riddle
Aeronautical University (FL). There, in
the school’s four-year-old Computational
Atmospheric Dynamics (CAD) laboratory,
Mike Hickey, associate dean of
the College of Arts and Sciences, is
using a brand-new supercomputer to
learn more about acoustic-gravity waves
in the upper portions of Earth’s atmosphere.
Driving Hickey’s research is a
131-node, 262-processor Beowulf cluster
(see box above), which runs simulations
of waves propagating through the
atmosphere. These waves ultimately
impact flying conditions, which is precisely
why the research is of such value
to a school like Embry-Riddle.
Such simulations used to take three or
four days to run; with the power of the
new machine, however, Hickey can run
them in a matter of hours. Elsewhere at
the school, other researchers are turning
to Beowulf to speed up projects of their
own. Hickey points to a number of plasma
physicists who are calling upon the
computer to simulate plasma flow and
interactions, and engineers who are running
simulations of the flow of gasses
through turbine engines. One professor
even uses the system to analyze information
from a database of all the commercial
airline flights in the US for the
last decade; from this information, the
professor is trying to predict flight
delays down the road.
“Especially at an engineering school
like ours, there’s a lot of numerically
intensive simulation work on campus,”
says Hickey, who explains that in general,
Beowulf clusters are groups of similar
computers running a Unix-like
operating system such as GNU/Linux
or BSD. “The best way around [the
demand for so much simultaneous simulation
work on one campus] was to try
and get a computer that serves everybody’s
needs.”
Still, the emergence of Embry-Riddle’s
supercomputer has not been without
hiccups. The first challenge revolved
around program code: Researchers can’t
just take the code that runs on a single
processor, move it over to the new
machine, and run it; instead, code for
the supercomputer needs to be heavily
modified in order to take advantage of
the multiple processing capabilities. Consequently,
last summer, Embry-Riddle
ran a workshop to educate some professors
and graduate students about how to
manipulate data for the new machine.
Another subject of the workshop: Message
Passing Interface (MPI), an architecture
that must be learned by a user in
order to understand when the high-performance
computer is ready to receive
new material.
The University of Utah has purchased a metacluster to tackle a new generation of bioinformatics applications; the
machine is one of the largest of its kind
anywhere in the academic world.
Embry-Riddle officials also have
tackled challenges of a more logistical
nature. CIO Cindy Bixler says that
when Hickey came to her and requested
the supercomputer, the school’s facilities
department didn’t understand that
investing in a machine of that magnitude
would require the school to rethink
its server room completely. Computer
clusters run hot, so the school had to
invest in additional air-conditioning
units. Bixler then brought in an engineer
to study the server room’s air flow and
figure out where to put the new cooling
device. Finally, of course, was the issue
of electricity—with the new machine,
Embry-Riddle’s energy bills went
through the roof.
“You can’t just flip the switch on a
high-performance computer and expect
everything else to work itself out,”
Bixler says. “This kind of effort takes
considerable planning, and in order to
avoid surprises, schools need to be
ready before they buy in.”