University of Utah: C-SAFE Uses Linux HPCC in Fire Research

A high performance computing cluster (HPCC) is an increasingly popular method for delivering computational power to CPU-hungry scientific applications. A cluster consists of several commodity personal computer systems (desktop or rackmount) connected with a commodity or special-purpose network. The low cost of these systems makes an attractive option for a wide range of scientific applications.

In 1997, the University of Utah created an alliance with the Department of Energy (D'E) Accelerated Simulation and Computing Program (ASCP), formerly ASCI, to form the Center for the Simulation of Accidental Fires and Explosions (C-SAFE, www.csafe.utah.edu). C-SAFE focuses specifically on providing state-of-the-art, science-based tools for the numerical simulation of accidental fires and explosions, especially within the context of the handling and storage of highly flammable materials. The primary objective of C-SAFE is to provide a software system comprising a Problem Solving Environment in which fundamental chemistry and engineering physics are fully coupled with non-linear solvers, optimization, computational steering, visualization, and experimental data verification. The availability of simulations using this system will help to better evaluate the risks and safety issues associated with fires and explosions. Our goal is to integrate and deliver a system that is validated and documented for practical application to accidents involving both hydrocarbon and energetic materials. Efforts of this nature require expertise from a wide variety of academic disciplines.

Additional Resources
Simulation of an explosive in a fire involves computing thermodynamics, fluid mechanics, chemical reactions, and other physical processes at each of millions to billions of points in the simulation domain. These scenarios require significant computing resources to perform accurately. Simulations performed to date employed hundreds to thousands of processors for days at a time to simulate portions of this problem. Fortunately, we were also granted time on the D'E ASCP computers at Los Alamos, Lawrence Livermore and Sandia National Laboratories. These machines are some of the fastest in the world, and we received considerable allocations of CPU time as a part of the C-SAFE grant. However, sometimes these resources are not enough.

Recently, the University of Utah purchased a Linux cluster that will be used to augment the resources provided to us through the D'E. The cluster will be used for development and debugging of the simulation, parameter studies, and small- to medium-scale simulations. Large-scale computations will still require use of the larger D'E ASCP computing resources. Unlike the shared D'E resources, the cluster will be largely devoted to C-SAFE research 365 days a year.

This cluster consists of 128 Dell PowerEdge 2550 servers, each containing two Pentium 4 Xeon processors. The entire cluster contains 256 Gigabytes of main memory and over nine Terabytes of disk space. Each server (node) runs the Linux operating system, and is networked with a Gigabit Ethernet (currently being installed). This is a class of system often referred to as a Beowulf cluster, named after one of the first systems to demonstrate this concept. The cluster is listed on the November 2002 Top 500 list (www.top500.org) as the 89th fastest computer in the world.

Linux Advantage
C-SAFE chose to implement a Linux cluster for one primary reason: We could not afford a traditional supercomputer of the class and scale of those purchased by the D'E for the ASCP program. The cluster solution is based on commodity PC technology; each node is very similar to a personal computer. This leverages the mass-production of personal computers to provide a price/performance ratio unattainable by most other technologies.

Other advantages of this approach include leveraging a growing familiarity with the Linux operating system, and of course the fact that it is considered an “in vogue” thing to do in Computer Science circles. We also intend to utilize the cluster for real-time interaction with these long-running simulations, something that would be difficult to do on a larger shared resource.

However, this power d'es not come without caveat. First, the application must be able to perform well in this environment. A Linux cluster has communication latencies (both hardware and software) that are quite large relative to a traditional supercomputer. The C-SAFE software employs a novel approach to parallelism that helps mitigate the effects of these latencies, and we performed preliminary benchmarks that gave us confidence that we could achieve good performance on a machine of this size. Second, the installation and setup of a cluster can be quite complex and time-consuming.

Fortunately, Dell provided the physical installation of our cluster, which saved considerable time. Installation of operating systems on each of the 128 nodes, as well as installation and setup of additional HPC software, consumed considerable additional time.

Numerous companies, both large and small, offer computational clusters of this type. The University of Utah chose Dell to implement our cluster because of a competitive price/performance ratio, company reputation, and features of the PowerEdge 2550. Dell provided physical installation and setup of the cluster, which immensely reduced the work required to deploy this machine. The cluster fills seven standard racks, stands about six feet tall, four feet deep and 14 feet long. It produces a considerable amount of heat and fan noise.

Lessons Learned
Those that are considering a large compute cluster should keep in mind several lessons that we have learned along the way (either from others or from the school of hard knocks).

First, start small; we leveraged expertise from building smaller clusters before we considered a project of this size. Second, do not underestimate the time involved in physical installation and software setup of a machine of this nature. Depending on the complexity of the required environment, this can take weeks to months. Third, planning is critical. Considerations must be taken for the power consumption, heat production and physical footprint of the machine.

Finally, the biggest lesson is that it can actually work; we are performing simulations that required computers that just five years ago cost nearly a factor of 100 more. This gives us hope that our largest simulations may be achievable on a relatively common platform in the not-too-distant future. Early access to this magnitude of resources from the D'E was critical to the development of these simulations. Once the hurdles are cleared, a high-performance cluster can provide a dramatic quantity of compute cycles for complex scientific computations such as those performed for C-SAFE research.

For information contact Steven Parker, Research Assistant Professor, University of Utah, at [email protected].

Featured