Computing Clusters continued, page 2 of 3

Currently, only about two-thirds of U2 is powered up at any one time, since the system is still going through testing. Once the Center gets the older system, Joplin, updated with the same version of RedHat that U2 is running, says Miller, it'll be integrated into U2's queuing system. "The [capacity] computing jobs will be automatically routed to what was called Joplin-those nodes; the [capability] computing jobs will be automatically routed to U2; and it'll be transparent to the users. It'll be one queue that's essentially an Intel (www.intel.com) Pentium line of chips." The only change for the users, explains Miller, is that they'll need to answer a couple of extra questions about their projects up front-such as "whether the code they need to run is 32-bit or 64-bit compatible, which means it can run on both places-or if it's just 32-bit or just 64-bit. And we'll be able to route things to the appropriate nodes."

Getting Access to the Power
To gain access, CCR has "basic minimal requirements," says Miller. The user describes the project and required resources in terms of storage and compute power. "We support research, discovery, and scholarship that requires high-end computational resources. That's obviously a moving target." In fact, he says, every six months or so, the center redefines what it means by "high-end" in terms of data, networking, visualization, or computing requirements. If a project requires fewer than, say, 16 processors running concurrently, Miller and his team will probably kick the request back to the individual to take back to his or her lab, department, or school for handling. An advisory committee evaluates troublesome proposals, but most of the time, the decisions are "obvious."

At NCSA, the process for gaining access is more formal. The proposal process is modeled after the NSF process, says Towns, which involves-for large requests-a peer review performed by a national review committee that meets quarterly. These are proposals requiring in excess of 200,000 CPU hours per year. Smaller requests-from 10,000 to 20,000 hours of time-are considered "development accounts," for start-up projects. Reviewed and rewarded continuously, the smaller accounts allow researchers to try out their applications on the system and understand performance characteristics in preparation for submitting larger proposals.

What's a CPU hour in cluster terms? According to Towns, it's equivalent to one hour of time on a single processor node. Since these are dual processor nodes, there's a total of 2,480 processors on Tungsten. If a project is running on 64 nodes, which is 128 processors, and it runs for one hour, the user has accumulated 128 service units or CPU hours.

Neither organization charges its academic users for the time they use on the clusters. In the case of NCSA, Towns says, they're granted allocations of time as part of grant awards, and those allocations are billed from their usage.

CCR considers its clusters part of the "university infrastructure," says Miller, "to support leading-edge science research."

Both centers also attract funds from academic users in situations where they've budgeted compute time in their grant proposals to cover expanded demands on staff time; compensation also comes from commercial users that make use of the resources.

The NCSA has about 100 staff members and another "15 or 20" graduate students working in its center to provide 24/7 support for its community of users. CCR has a technical team of 13, consisting of operations people and computational scientists. The former, system administrators, keep the systems running, and the latter work closely with users to figure out, for example, what applications are needed for a particular project or to help optimize code.

Miller estimates that about half of the applications running on the CCR computers are off-the-shelf-code that has been paid for or is freeware or shareware. The other half is "home-grown." In the case of NCSA, Towns says, "By and large [the majority of our users are faculty researchers] using applications they've developed to solve the problems that they're attacking."

What It Takes To Work in Academic Computing

Every wondered what it takes to work in a high-performance computing environment? According to John Towns, senior associate director, Persistent Infrastructure Directorate at NCSA, there's no other job that can prepare you-you arrive with particular qualifications then get on-the-job training.

At NCSA, to obtain a position on the academic professional staff, you're required to have a Bachelor's in Science. Frequently, staff members have worked in a research environment, possibly as a graduate student. Their backgrounds may include hard sciences, or engineering, or computer science. All have "an affinity for computers" and an interest in the high end-whether that is "compute systems, networks, visualization, or data storage," Towns says. Frequently, they don't work normal hours, which is an advantage in an environment that runs 24 hours a day.

Shared-Memory Machines vs. Clusters
One misunderstanding that can crop up about clusters is that they replace the old-style mainframe-type or mass-storage computers. In reality, each setup is advantageous to a specific type of computing work. "Gene sequencing is fairly trivially parallelized. You can spread it across a cluster and use the resources well," Towns says. In other words, every processor is, practically speaking, running a separate copy of the application, and no processor needs to talk much to the other applications.

CCR's Miller refers to this as "capacity computing." In order for a scientist to solve a certain problem, they may need to run a thousand or ten thousand simulations, each of which is best run on a single CPU... It's the aggregate of all those results that will solve their scientific problems."

Another class of programs runs as a single application. As Towns explains it, "Imagine that you're running simulations, and... you create a grid that represents a space and something happens in that space. Often where something interesting is happening, you need to redefine the spacing on the grid to accurately represent what's [going on]. A class of applications has been developed that in a dynamic way redefines the spacing in the grid where it needs to... If you try to represent the entire grid at the finest resolution, you don't have a big enough memory machine to do it. What you do is refine it where it's necessary... You have some nodes that have a lot of work to do and some that don't. In a shared memory system, you can easily redistribute that work among the processors, so you can keep them all busy together, and move the application along much more quickly."

TeraGrid

A lot of research in grid computing is currently taking place. The TeraGrid is an effort by the National Science Foundation (www.nsf.gov) to build and deploy the world's largest distributed infrastructure for open scientific research. In practical terms that means developing a common environment and interface to the user community to a rather diverse set of high-performance computing resources-including clusters. In some cases it also involves hooking up the physical systems to be used in conjunction with each other and providing environments that are similar so that researchers can move between systems more easily.

As John Towns, senior associate director, Persistent Infrastructure Directorate at NCSA, explains, many researchers have multi-stage applications that require different kinds of computing architectures to solve. "So the TeraGrid is attempting to facilitate the use of these multiple architectures-often sited at different locations-to support their research efforts."

Learn more about the TeraGrid at www.teragrid.org. Learn more about grid computing at www.ggf.org.

1 2 3

»Next