What's Your HPC Game Plan?
- By Charlene O’Hanlon
In our high-performance computing roundtable, IT pros frankly discuss funding, centralization, silo cluster migration, and other pressing campus HPC issues. Fine-tune your supercomputing strategy here!
The University of Tennessee's Ragghianti (left), Jennings (right), and high-performance computing cluster
A university's supercomputing infrastructure certainly can put it on the map in the research and development arena. Yet developing a cutting-edge high-performance computing facility requires an intricate mix of grants, faculty startup dollars, external funding contracts, IT budget allocation-and internal marketing. Campus Technology recently conducted a "virtual roundtable" discussion with four IT leaders, each grappling with HPC challenges: Henry Neeman, director, OU Supercomputing Center for Education & Research (OSCER), University of Oklahoma; Jim Bottum, vice provost and CIO for computing and information technology, Clemson University (SC); Gerald Ragghianti, IT administrator for high-performance computing, The University of Tennessee- Knoxville; and Larry Jennings, IT manager, also from The University of Tennessee. Their challenges may be yours, too.
Campus Technology: Let's get a quick overview of how supercomputing efforts at each of your institutions are progressing, and how these initiatives affect you and your departments.
Henry Neeman: As director of the OU Supercomputing Center for Education & Research, a large part of my job is taken up by supercomputing. I also teach a course for the Computer Science department most semesters, but 90 percent of my work time is spent on supercomputing. Our department is a division of OUIT-University of Oklahoma Information Technology. We're a fairly small staff: There are four of us, three of whom are operations folks, so it's up to me to handle things like fundraising and whatnot. The overwhelming majority of our funding comes from OU's core IT budget.
Jim Bottum: I'm the CIO responsible for all of the traditional activities you would expect from that position, but Clemson has made highperformance computing a priority, as part of its academic roadmap. So, supercomputing is a significant emphasis area within my job.
Gerald Ragghianti: At The University of Tennessee, our centralized HPC effort is relatively new; it started about one-and-a-half years ago. I'm actually the first full-time staff member 100 percent dedicated to our centralization cluster; for [my colleague] Larry, it's not quite that much. I perform centralized administration of the cluster, and take care of speaking with researchers to customize our clusters to their needs.
CT: How does your institution find the majority of its funding for supercomputing?
Neeman: Prior to fall of 2001, before OSCER was founded, we had purchased several large HPC machines over a 15- to 20-year period. But each was funded individually, with one-shot external funding for purchase, and cobbledtogether people time for maintenance. There was no mechanism for ongoing support-funds or staffing-for highperformance systems. Nor was there any mechanism for training faculty and staff to use such high-end resources. OSCER changed all that.
The founding of our supercomputing center also coincided very closely with the hiring of our first CIO, Dennis Aebersold. Dennis had a keen interest in pursuing a research computing component within IT, and his engagement and vision were crucial to the success of OSCER. Other key factors: our VP for Research T.H. Lee Williams' interest in providing support for computingintensive research, along with strong interest on the part of the faculty and staff who wanted access to HPC resources. And finally, my own interest in teaching HPC helped make those resources more straightforward to use, than would normally be the case.
"We are beginning to see departments wanting to buy in to the cluster that we're maintaining centrally. That helps faculty focus more on their own research, and it also allows us to provide more HPC resources for the rest of the academic computing on campus." -Larry Jennings, The University of Tennessee-Knoxville
When we first got started, some of the funding for OSCER came out of a contract with a private company involved in a weather project; weather forecasting involves a tremendous amount of supercomputing. A little bit of the funding-something on the order of 10 to 15 percent-came out of the research VP's office. The bulk of the funding came out of the core IT budget; the CIO's budget.
Now, the overwhelming majority of our funding is simply core IT budget. We look at research computing in general, and HPC in particular, as a driver for increasing the overall external funding base of the university. For example, because of the budget dollars invested in OSCER, we've seen external funding come in at a ratio of about 5 to 1. That is, for every dollar spent on OSCER, about five dollars comes in from external funding agencies for projects that make use of OSCER-organizations such as the National Science Foundation, the Department of Energy, Department of Defense, NASA, National Institutes of Health, the National Oceanic and Atmospheric Administration, state agencies, and private companies.
Less than 10 percent of our funding comes from grants. In late 2003, for instance, we received a modest-sized major research instrumentation grant, on the order of $500,000, of which about $300,000 went to buying a small Itanium cluster. That's the last time we got a grant to pay for equipment. Currently, I have an NSF Cyberinfrastructure TEAM [Training, Education, Advancement, and Mentoring] grant, which focuses on finding better ways to teach supercomputing, rather than on funding infrastructure. In fact, OU is highly committed to teaching highperformance computing, as well as providing HPC resources.
Bottum: Clemson has invested in supercomputing as part of its academic plan, and there is a central contribution of funding dollars. For about a year, we have had a few co-location-type supercomputers at the data center. And three or four weeks ago, we put our first partnership computer on the floor and opened it up to friendly users. We call it the condo cluster. Eight or nine faculty have bought into it, contributing monies just toward the hardware, not for all the stuff that's behind it such as the network fabric and the storage, which is funded by the university. It's about a 60/40 funding scenario; 60 percent is what came in through faculty grants and startup packages, and 40 percent represents the central investment.
CT: Do any of the colleges and departments within your institutions pursue their own HPC grants for individual projects, and then build out environments that are essentially silos, so that the rest of the university can't take advantage of all supercomputing resources? If so, is that a problem?
Neeman: That certainly does occur here at OU. I don't know that I would characterize it as a problem, though, because the HPC resources that an individual faculty member can put together are substantially less than the resources we can provide centrally. We would rather faculty participate in condominium computing and, if they do, we pay for everything except the nodes; they buy the nodes and we install them and deploy them within our cluster infrastructure. Of course, they have to purchase hardware that's compatible with what we have.
Currently, though, faculty are more accustomed to having their own systems, and while some are participating in condo computing, it's still probably less than 5 percent. When faculty members choose to go off on their own, then they're on their own. We are not in a position to provide the labor to maintain third-party systems. We also don't have space in the machine room for those systems, so that becomes the faculty's problem.
CT: When you are putting together your annual IT budget, are you looking at it from the point of view that supercomputing is a shared resource and you're going to support it financially, but the departments and colleges have the option of utilizing HPC resources or not?
Neeman: We don't force anybody to participate, but it's no trivial matter for faculty to find all the space, power, and cooling that they need for HPC systems. Perhaps more importantly, it's no small matter for them to take the place of the labor qualified to do the work. If faculty have a mechanism for doing all that- and there are indeed some small third-party clusters on campus-we don't prevent them from using their own systems, and there's no hard feelings involved. In fact, some of the people who have their own third-party clusters are also active users on our system; but we don't help them with their own clusters because we don't have the resources to do so.
Jennings: Much of what Henry is saying applies to us at UT. We are beginning to see researchers wanting to get away from handling their own clusters- it takes time away from their research. Now, departments want to buy in to the cluster that we're maintaining centrally. It helps them out, and it also allows us to provide more resources for the rest of the academic computing on campus. I think this movement away from silo clusters is something that's going to catch on here: Our Engineering department has been a strong proponent of this for the past couple of years, and I believe that we will see acceptance across the departments as we demonstrate that the centralized model works.
"I've heard from researchers, ‘We don't know if we can entrust our research to you.' What they need is confidence that the central investment in HPC and the quality of service operations are going to be there." -Jim Bottum, Clemson University
CT: If faculty researchers are "buying in to" the nodes in the condo clusters, are they actually using grant money to do so?
Neeman: I see grant money, but more often the money researchers use to buy in to the condo cluster comes from startup funds provided to a new faculty member by the university. So, for example, when a department is hiring a new faculty member, it will negotiate a dollar amount-$15,000, $50,000, $500,000-of startup funds a candidate will receive for research when he or she starts at the university. Some new hires choose to spend some of that money on HPC hardware.
Jennings: For the most part, that's the way it is here at UT, as well. And when our faculty members are writing grant proposals, we try to encourage them to consider computing resources as part of the grant. But that doesn't necessarily mean the resources will get funded. Sometimes, the grant providers think that's just part of the overall infrastructure that should already be in place at the university. Then, of course, it's tougher to get the funding for those resources.
CT: Do you find that faculty members resist committing their funds to a centralized HPC resource; that they don't want to give up their siloed projects?
Neeman: I think most institutions have one or more faculty members saying, "It's my money, I'm going to spend it on what I want, and I don't want anything to do with the centralized resource." The question is the extent to which that goes on. At OU, there are fewer than a half-dozen third-party clusters on campus, all quite small. And in practice, it would be very laborintensive to port each of those fourhundred- some users to the centralized resource. There's a G5 cluster on campus, a Power5 resource, and some AMD resources, altogether running two hundred different codes. We'd have to migrate folks from resource to resource by hand. Typically, the people using those smaller resources aren't paying for a commercial scheduler like LSF, which, in principle, allows two disparate machines to operate together the way condo pools do. It's not that migrating those users can't be done in theory; it's just fairly impractical.
Bottum: A lot of this is sociological. Researchers must have confidence both in the centralized resource and in the operation and administration. I've heard from researchers, "We don't know if we can entrust our research to you." That "you" could be anybody: an administration; a central computing operation; or a national center, NSF or otherwise. What researchers need is confidence that the central investment in HPC and the quality of service operations are going to be there.
Jennings: At UT, we've got a department willing to put some trust in our new HPC resources, so we're trying to use that to springboard into a much more widespread offering for the rest of the campus. It boils down to making sure that you're doing things right, so that the word gets out and researchers realize, "Hey, we can trust these guys to do a good job providing a fair share of resources to everybody." So far, we've been reasonably successful, and I'm looking forward to getting some other departments on board.
Bottum: There can be many, many incentives for researchers to embrace centralized HPC resources: security, 24- hour operations, and support, not to mention getting a better deal on hardware. Last January, one of our faculty members bought an HPC system for his own research, and whichever way you measure it, he got a much better deal when later on he bought into the condo cluster. That's the carrot, but then some institutions also end up using the stick: Last summer, the CFO at one university struggling with financial issues shut off the air conditioning in a bunch of campus buildings. That flushed a lot of siloed projects out of the woodwork: All of a sudden you had individual supercomputers and servers overheating- machines the administration didn't know about beforehand. If the researchers involved had utilized the off-campus data center, their projects would not have been affected.
CT: Do you know of any universities that have mandated all supercomputing must go through a central HPC resource?
Neeman: I imagine there are one or two out there, but I haven't actually heard of any. At most institutions, it would be very difficult to make that mandate stick, because faculty don't react well to having things dictated to them. A model that works pretty well is to say, "If you're going to go your own way, then you're going to do it on your own." That is, the central administration does not support or provide resources to someone who chooses to use a thirdparty system, but that individual is free to do so.
Ragghianti: It's really quite a slippery slope to try to mandate use of the centralized resource. In my experience, faculty are determined to be as flexible as possible with their research-and for a lot of people, that means they have to hold on to their own machines and be able to touch them whenever they like. A lot of these small third-party clusters actually started as desktop machines, and then sort of morphed and grew into a few racks in a makeshift machine room.
CT: Do you have any final tips for institutions looking to get centralized HPC initiatives funded and off the ground?
Neeman: An administrator may be the right person to make a decision about how to spend HPC funding, but the advocacy to drive the search for the money in the first place has to come from the people who are going to use the supercomputing resources and, in practice, that means the faculty and their deans. If you have a great many faculty coming together to say that having a centralized supercomputing resource is crucial to their success, that substantially raises the probability of getting more dollars for the central HPC initiative-particularly if among that group of supporters, there are a few heroes known to the administration as big money-getters. Frankly, if you don't have the faculty advocating for it, it's not going to happen.
Ragghianti: In my eyes, the big benefit of our efforts here at UT has been that when researchers prepare their grant proposals, they are able to point to a tested and well-used centralized system, and this lends an air of legitimacy and lets the granting organization know that the funds will be used efficiently.
CT: So, it's the notion of "spending money to make money"? In other words, if a university has centralized highperformance resources in place, that helps bring in more grant money for researchers?
Neeman: I think that's an excellent point: HPC is an investment, not a cost. But that's a tough argument to sell, so we're always collecting data to support it. In the end, HPC will pay off in terms of return on investment; that is, more external dollars will come in for our researchers-far more external dollars than would have come in without the supercomputing resources, and far more than what's initially been invested in the centralized resource.
High-Performance Happy More and more universities are now centralizing their high-performance computing resources-benefiting not only IT departments, but the researchers, too.
Solution Center: High-Performance Computing.