Home > Computing Clusters: Sometimes You Can't Make It On Your Own

Mystery Content

Computing Clusters: Sometimes You Can't Make It On Your Own

11/18/2005

Every wondered what it takes to work in a high-performance computing environment? According to John Towns, senior associate director, Persistent Infrastructure Directorate at NCSA, there's no other job that can prepare you-you arrive with particular qualifications then get on-the-job training.

At NCSA, to obtain a position on the academic professional staff, you're required to have a Bachelor's in Science. Frequently, staff members have worked in a research environment, possibly as a graduate student. Their backgrounds may include hard sciences, or engineering, or computer science. All have 'an affinity for computers' and an interest in the high end-whether that is 'compute systems, networks, visualization, or data storage,' Towns says. Frequently, they don't work normal hours, which is an advantage in an environment that runs 24 hours a day.

Shared-Memory Machines vs. Clusters
One misunderstanding that can crop up about clusters is that they replace the old-style mainframe-type or mass-storage computers. In reality, each setup is advantageous to a specific type of computing work. 'Gene sequencing is fairly trivially parallelized. You can spread it across a cluster and use the resources well,' Towns says. In other words, every processor is, practically speaking, running a separate copy of the application, and no processor needs to talk much to the other applications.

CCR's Miller refers to this as 'capacity computing.' In order for a scientist to solve a certain problem, they may need to run a thousand or ten thousand simulations, each of which is best run on a single CPU... It's the aggregate of all those results that will solve their scientific problems.'

Another class of programs runs as a single application. As Towns explains it, 'Imagine that you're running simulations, and... you create a grid that represents a space and something happens in that space. Often where something interesting is happening, you need to redefine the spacing on the grid to accurately represent what's [going on]. A class of applications has been developed that in a dynamic way redefines the spacing in the grid where it needs to... If you try to represent the entire grid at the finest resolution, you don't have a big enough memory machine to do it. What you do is refine it where it's necessary... You have some nodes that have a lot of work to do and some that don't. In a shared memory system, you can easily redistribute that work among the processors, so you can keep them all busy together, and move the application along much more quickly.'

TeraGrid

A lot of research in grid computing is currently taking place. The TeraGrid is an effort by the National Science Foundation (www.nsf.gov) to build and deploy the world's largest distributed infrastructure for open scientific research. In practical terms that means developing a common environment and interface to the user community to a rather diverse set of high-performance computing resources-including clusters. In some cases it also involves hooking up the physical systems to be used in conjunction with each other and providing environments that are similar so that researchers can move between systems more easily.