University of Utah: C-SAFE Uses Linux HPCC in Fire Research
A high performance computing cluster (HPCC) is an increasingly popular method
for delivering computational power to CPU-hungry scientific applications. A
cluster consists of several commodity personal computer systems (desktop or
rackmount) connected with a commodity or special-purpose network. The low cost
of these systems makes an attractive option for a wide range of scientific applications.
In 1997, the University of Utah created an alliance with the Department of
Energy (D'E) Accelerated Simulation and Computing Program (ASCP), formerly ASCI,
to form the Center for the Simulation of Accidental Fires and Explosions (C-SAFE,
www.csafe.utah.edu).
C-SAFE focuses specifically on providing state-of-the-art, science-based tools
for the numerical simulation of accidental fires and explosions, especially
within the context of the handling and storage of highly flammable materials.
The primary objective of C-SAFE is to provide a software system comprising a
Problem Solving Environment in which fundamental chemistry and engineering physics
are fully coupled with non-linear solvers, optimization, computational steering,
visualization, and experimental data verification. The availability of simulations
using this system will help to better evaluate the risks and safety issues associated
with fires and explosions. Our goal is to integrate and deliver a system that
is validated and documented for practical application to accidents involving
both hydrocarbon and energetic materials. Efforts of this nature require expertise
from a wide variety of academic disciplines.
Additional Resources
Simulation of an explosive in a fire involves computing thermodynamics, fluid
mechanics, chemical reactions, and other physical processes at each of millions
to billions of points in the simulation domain. These scenarios require significant
computing resources to perform accurately. Simulations performed to date employed
hundreds to thousands of processors for days at a time to simulate portions
of this problem. Fortunately, we were also granted time on the D'E ASCP computers
at Los Alamos, Lawrence Livermore and Sandia National Laboratories. These machines
are some of the fastest in the world, and we received considerable allocations
of CPU time as a part of the C-SAFE grant. However, sometimes these resources
are not enough.
Recently, the University of Utah purchased a Linux cluster that will be used
to augment the resources provided to us through the D'E. The cluster will be
used for development and debugging of the simulation, parameter studies, and
small- to medium-scale simulations. Large-scale computations will still require
use of the larger D'E ASCP computing resources. Unlike the shared D'E resources,
the cluster will be largely devoted to C-SAFE research 365 days a year.
This cluster consists of 128 Dell PowerEdge 2550 servers, each containing two
Pentium 4 Xeon processors. The entire cluster contains 256 Gigabytes of main
memory and over nine Terabytes of disk space. Each server (node) runs the Linux
operating system, and is networked with a Gigabit Ethernet (currently being
installed). This is a class of system often referred to as a Beowulf cluster,
named after one of the first systems to demonstrate this concept. The cluster
is listed on the November 2002 Top 500 list (www.top500.org)
as the 89th fastest computer in the world.
Linux Advantage
C-SAFE chose to implement a Linux cluster for one primary reason: We could not
afford a traditional supercomputer of the class and scale of those purchased
by the D'E for the ASCP program. The cluster solution is based on commodity
PC technology; each node is very similar to a personal computer. This leverages
the mass-production of personal computers to provide a price/performance ratio
unattainable by most other technologies.
Other advantages of this approach include leveraging a growing familiarity
with the Linux operating system, and of course the fact that it is considered
an “in vogue” thing to do in Computer Science circles. We also intend to utilize
the cluster for real-time interaction with these long-running simulations, something
that would be difficult to do on a larger shared resource.
However, this power d'es not come without caveat. First, the application must
be able to perform well in this environment. A Linux cluster has communication
latencies (both hardware and software) that are quite large relative to a traditional
supercomputer. The C-SAFE software employs a novel approach to parallelism that
helps mitigate the effects of these latencies, and we performed preliminary
benchmarks that gave us confidence that we could achieve good performance on
a machine of this size. Second, the installation and setup of a cluster can
be quite complex and time-consuming.
Fortunately, Dell provided the physical installation of our cluster, which
saved considerable time. Installation of operating systems on each of the 128
nodes, as well as installation and setup of additional HPC software, consumed
considerable additional time.
Numerous companies, both large and small, offer computational clusters of this
type. The University of Utah chose Dell to implement our cluster because of
a competitive price/performance ratio, company reputation, and features of the
PowerEdge 2550. Dell provided physical installation and setup of the cluster,
which immensely reduced the work required to deploy this machine. The cluster
fills seven standard racks, stands about six feet tall, four feet deep and 14
feet long. It produces a considerable amount of heat and fan noise.
Lessons Learned
Those that are considering a large compute cluster should keep in mind several
lessons that we have learned along the way (either from others or from the school
of hard knocks).
First, start small; we leveraged expertise from building smaller clusters before
we considered a project of this size. Second, do not underestimate the time
involved in physical installation and software setup of a machine of this nature.
Depending on the complexity of the required environment, this can take weeks
to months. Third, planning is critical. Considerations must be taken for the
power consumption, heat production and physical footprint of the machine.
Finally, the biggest lesson is that it can actually work; we are performing
simulations that required computers that just five years ago cost nearly a factor
of 100 more. This gives us hope that our largest simulations may be achievable
on a relatively common platform in the not-too-distant future. Early access
to this magnitude of resources from the D'E was critical to the development
of these simulations. Once the hurdles are cleared, a high-performance cluster
can provide a dramatic quantity of compute cycles for complex scientific computations
such as those performed for C-SAFE research.
For information contact Steven Parker, Research Assistant Professor, University
of Utah, at [email protected].