Computing Clusters continued, page 3 of 3 -- Campus Technology

Computing Clusters continued, page 3 of 3

Computer Years

Interestingly, both types of large systems have a fairly short life-about three to five years, usually on the lower end, according to Towns. After that, the maintenance and operational costs for the hardware becomes too high and it's time to "simply buy a new system." The NCSA looks at a doubling of computational resources every one to two years, and that's matched by demand. That means the center is in some stage of acquiring new equipment every year.

NCSA expects to submit a proposal looking for either a $15 million or $30 million system, with responses due in February. The winning bidder (Towns says there are typically only three or four companies bidding) will be required to have its solution in "substantive production state" by about March 2007. From there, he says, "there's the whole procurement process, deployment, testing, and putting it into production.

The installation and testing process has many stages and is time-consuming. In the case of Tungsten, the Dell cluster at NCSA, the center received the hardware over the course of two months. During that time, it was arriving on large trucks, being unpacked, being set up, and then configured and tested. But that was preceded by several months of software and hardware testing in-house at Dell.

Once the installation at the client site took place, NCSA did a lot of applications testing to verify that the system actually worked. Then, before it went to production state, the center opened up the equipment to what Towns calls the "friendly user period." This lasts about two months and allows the general user community to compile and test their applications. There's no charge to their time allocations for this, but they also need to understand that it might be unstable until all the issues are worked out. "It's a good way for us to shake down the system before we go to production," says Towns.

For the cluster installation at CCR, Dell was the prime contractor, and it subcontracted aspects of the project to other vendors, including Myricom Myrinet (www.myri.com) network communications, EMC (www.emc.com) storage arrays, Force 10 Networks (www.force10networks.com) for switch/routers, and IBRIX (www.ibrix.com) for file storage. When it was delivered, the prime orchestrated installation, "making sure the right people from the right vendor showed up at the right time," says Miller. Why Dell? When CCR went out to bid, Miller recalls, "We met with all the vendors and had discussions, and Dell was clearly head and shoulders above anything we were looking for at that point in time."

Managing clusters is mostly automated through job schedulers, batch schedulers, and other resource managers and monitors. At CCR, for example, "Larry" and "Adam," the aforementioned administrative nodes, monitor all of the cluster nodes and "continually ask them, 'Are you still up? Are you still running? Are you healthy?'" says Miller. When problems arise-a file system gets full or a network link g'es down-the human system administrators get notified. If a node ceases to work or a power supply "explodes," he says, the job scheduler will continue scheduling, but not on that node.

The Edges Keeps Moving on High-Performance Computing

Clusters are by no means the final word in high-performance computing. As Towns points out, each research community "has a set of applications that have different system architecture requirements in order to execute and perform well."

Recently, the NSF issued a solicitation for systems totaling $30 million. In performance terms, according to Miller, that's "roughly 10 times U2's speed." Then four years from now they want it be "100 to a thousand times what U2's speed is." As he points out, "There's simply no shortcut in terms of solving some of these big-what they used to call "grand challenge"-problems without big machines. If you're looking at whatever it may be-the physics of the universe, or biochemical processes in the brain, or analyzing the spread of infections... they just require massive amounts of computing power."

From humble beginnings as commodity devices, equipment that once only existed on the desktop will continue proving its mettle in dazzling displays of high-performance computing clusters.

Resources

Beowulf, the home of one of the original cluster projects:
www.beowulf.org

Dell Campus Architecture
www.dell4hied.com/solutions_detail.php?si=188&cn=1

Dell High Performance Computing Clusters
www1.us.dell.com/content/topics/global.aspx/solutions/en/clustering_hpcc?c=us&cs=555&l=en&s=biz

EMC
www.emc.com

Force10 Networks
www.force10networks.com

IBM
www-03.ibm.com/servers/eserver/clusters

IBRIX
www.ibrix.com

Myricom
www.myri.com

National Center for Supercomputing Applications (NCSA)
www.ncsa.uiuc.edu

SUNY-Buffalo Center for Computational Research
www.ccr.buffalo.edu

1 2 3