Home > Computing Clusters continued, page 3 of 3

About

Computing Clusters continued, page 3 of 3

11/21/2005

Computer Years
Interestingly, both types of large systems have a fairly short life-about three to five years, usually on the lower end, according to Towns. After that, the maintenance and operational costs for the hardware becomes too high and it's time to "simply buy a new system." The NCSA looks at a doubling of computational resources every one to two years, and that's matched by demand. That means the center is in some stage of acquiring new equipment every year.

NCSA expects to submit a proposal looking for either a $15 million or $30 million system, with responses due in February. The winning bidder (Towns says there are typically only three or four companies bidding) will be required to have its solution in "substantive production state" by about March 2007. From there, he says, "there's the whole procurement process, deployment, testing, and putting it into production.

The installation and testing process has many stages and is time-consuming. In the case of Tungsten, the Dell cluster at NCSA, the center received the hardware over the course of two months. During that time, it was arriving on large trucks, being unpacked, being set up, and then configured and tested. But that was preceded by several months of software and hardware testing in-house at Dell.

Once the installation at the client site took place, NCSA did a lot of applications testing to verify that the system actually worked. Then, before it went to production state, the center opened up the equipment to what Towns calls the "friendly user period." This lasts about two months and allows the general user community to compile and test their applications. There's no charge to their time allocations for this, but they also need to understand that it might be unstable until all the issues are worked out. "It's a good way for us to shake down the system before we go to production," says Towns.

For the cluster installation at CCR, Dell was the prime contractor, and it subcontracted aspects of the project to other vendors, including Myricom Myrinet (www.myri.com) network communications, EMC (www.emc.com) storage arrays, Force 10 Networks (www.force10networks.com) for switch/routers, and IBRIX (www.ibrix.com) for file storage. When it was delivered, the prime orchestrated installation, "making sure the right people from the right vendor showed up at the right time," says Miller. Why Dell? When CCR went out to bid, Miller recalls, "We met with all the vendors and had discussions, and Dell was clearly head and shoulders above anything we were looking for at that point in time."

Managing clusters is mostly automated through job schedulers, batch schedulers, and other resource managers and monitors. At CCR, for example, "Larry" and "Adam," the aforementioned administrative nodes, monitor all of the cluster nodes and "continually ask them, 'Are you still up? Are you still running? Are you healthy?'" says Miller. When problems arise-a file system gets full or a network link g'es down-the human system administrators get notified. If a node ceases to work or a power supply "explodes," he says, the job scheduler will continue scheduling, but not on that node.