High-Performance Computing | Feature
HPC: Rent or Buy?
Researchers who need to process vast amounts of data can buy an HPC cluster or rent a cloud-based solution. Increasingly, though, scholars are opting for a third, hybrid option.
Rent or buy? It's a question we ask about everything from housing to textbooks. And it's a question universities must consider when it comes to high-performance computing (HPC). With the advent of Amazon's Elastic Compute Cloud (EC2), Microsoft Windows HPC Server, Rackspace's OpenStack, and other cloud-based services, researchers now have the ability to quickly rent space and time on an HPC cluster, a collection of linked nodes that run as if they were one computer. As with buying a house or car, though, plenty of factors bear on the selection process: the scope and duration of the project; the number of people using the system; the type of work to be done; whether an individual department or central IT will buy and administer the equipment; and who's footing the bill for electricity and cooling.
With budgets under pressure across higher education, cost is always going to be a major factor. Weighing in favor of the buy option is the fact that HPC equipment has become far more affordable: Systems that cost hundreds of thousands of dollars 10 years ago are now priced in the tens of thousands. As HPC prices have decreased, though, the amount of research that requires such high computing power has dramatically increased. An in-house HPC cluster that might have sufficed a decade ago may now be swamped by the demand.
According to Dennis Gannon, director of cloud research strategy for Microsoft Research Connections, a lot of universities are trying to analyze the cost benefits of each option. To provide some answers, he attempted to compare the costs of buying a little cluster of servers--120 processors--with the expense of renting comparable space.
"I concluded that if you're running this cluster of yours 24 hours a day, seven days a week for the lifetime of the resource, it's probably cost effective compared to doing the same thing in a commercial cloud," explains Gannon. The catch is that almost nobody runs a cluster like that. A more usual scenario is that researchers use the cluster in bursts, as the research requires, and it sits idle the rest of the time. And if it is run at full capacity, it means that some work is inevitably not done, because the cluster is fully booked.
That's exactly what happened to a team in the Department of Engineering Science at the University of Oxford (UK). Projects across the research group were becoming more and more computationally intensive. One project, by Dan Lussier, a Ph.D. candidate at the time, was particularly intensive: It used a molecular-modeling approach to study the behavior of liquid droplets. In this simulation, each atom in the system was modeled explicitly, producing a vast number of atom-to-atom interactions. Calculating these interactions would have bogged down a regular desktop machine for months, maybe a year. "That's just not feasible when you want an answer back," explains Lussier. "And getting an answer back is necessary for making adjustments."
The team considered renting HPC space from a central university resource before deciding to fund the purchase of a dedicated HPC cluster with part of its grant money. Lussier used the new cluster extensively until he reached a "pinch point" in his research when he needed to run some simulations but couldn't get time on the HPC cluster that he shared with his colleagues.
Despite such drawbacks, buying HPC clusters for in-house research remains a common scenario, because it also has some obvious benefits. Lussier's experience notwithstanding, researchers on projects with their own HPC cluster generally don't have to wait in line for time and space on the system, which means the research keeps moving forward. And they can configure the cluster to suit their needs and load the applications they choose. Plus, with the equipment right there on campus, teams don't have to worry about transporting massive chunks of data, and they can troubleshoot issues as they arise with local, dedicated IT support.
Understanding the Flavors of HPC
It's important to recognize that not all high-performance computing systems work the same way. Indeed, choosing between a cloud-based or in-house HPC solution may well depend on the kind of processing work that needs to be done. Dennis Gannon, director of cloud research strategy for the Microsoft Research Connections team, analyzed the work performed by about 90 research teams that were given access to Microsoft Azure cloud resources over the last two years. He concluded that four major architectural differences between cloud clusters and supercomputers--machines running thousands, even tens of thousands of processors--determine which types of high-performance computing should be done where:
- Data centers (or cloud networks) are made up of racks of basic servers. They don't currently offer the graphics enhancements offered by GPUs (graphics processing units) that supercomputers use for simulations, and they don't have "other accelerators."
- Data centers communicate via internet protocols, while supercomputers communicate over high speed "physical and data link layers" and have minimal interoperation with the internet.
- Each server in a data center hosts virtual machines, and the cloud runs a fabric scheduler, which manages sets of VMs across the servers. This means that if a VM fails, it can be started up again elsewhere. But it can be inefficient and time-consuming to deploy VMs for each server when setting up batch applications common to HPC.
- Data in data centers is stored and distributed over many, many disks. Data is not stored on the local disks of supercomputers, but on network storage.
Given these differences, Gannon and his coresearcher Geoffrey Fox believe that clouds are good for large-data collaboration and data analytics like MapReduce (a strategy for dividing a problem into hundreds or thousands of smaller problems that are processed in parallel and then gathered, or reduced, into one answer to the original question). In contrast, large-scale simulations or computations that require individual processors to communicate with each other at a very high rate are better suited for supercomputers. However, as the data center cloud architecture is rapidly evolving, commercial clouds are beginning to offer more supercomputing capabilities.
The Case for the Cloud
So what of the alternative? Does the cloud offer research teams a more reliable and cost-effective way to do their work? Unfortunately, it can be very difficult to compare apples to apples in this space. Amazon looks like a bargain at first glance, especially in a side-by-side comparison with two other cloud-based solutions, Penguin and SGI. But Penguin and SGI offer HPC with support, which means a dedicated staff is available to optimize performance. This kind of optimization can add up to big--but hard to quantify--savings over the lifespan of a study.
Lussier experienced the fog of cloud solutions firsthand. When he was unable to complete his research on the in-house HPC cluster, he went looking for a pay-as-you-go HPC service, but found it hard to comparison shop because each service offers slightly different options. Ultimately, he decided to go with Penguin's On Demand HPC Cloud Service.
"They spoke my language," recollects Lussier. "They were offering all of the things I wanted in terms of high-performance computing rather than a web service; they built their cluster as an HPC cluster, so between the nodes--between the boxes on a rack--they had high speed interconnect, which lowers the cost of communication between the separate processors."
In addition, Penguin set up Lussier's software, Large-Scale Atomic/Molecular Massively Parallel Simulator (LAMMPS), for him. "It's open source. It can be a bear to build. You have to make sure it's configured properly for the system it's running on," he says. "So instead of me having to learn their system, they did it so it was properly done the first time."
Lussier says he was able to start working on the Penguin cluster within days of selecting its HPC service. He felt his data was in good hands, because Penguin had a strong login and encrypted transmission between himself and the cluster. And, as an added bonus, he wasn't charged for temporary data storage or for moving data on and off the Penguin cluster.
A significant benefit of a cloud-based HPC solution like this is flexibility. If researchers have a huge job that needs to be done quickly, they can request the use of more processors. This way, a job that might take two weeks on a proprietary system of 120 processors could be done in a week on 250 processors instead.
The unsung hero in high-performance computing is the network backbone that spans the US, connecting university researchers around the country and the world. Without this souped-up network, capable of delivering data at 100 gigabits per second--as near to the speed of light as the laws of physics permit--there could be no high-performance cloud computing and no efficient data or HPC resource sharing. In fact, without it, scientific research would be seriously hobbled.
Internet2, a consortium of more than 200 US research universities, numerous industrial firms, and national labs, is responsible for this network backbone of nearly 15,000 miles of owned fiber, with a total optical capacity of 8.8 terabits per second. According to Steve Wolff, CTO of Internet2, the consortium finished deploying a 100 gigabit network to its member institutions early this fall. Internet2 also offers services such as file-sharing facilities and end-to-end network support.
A recent Internet2 innovation is InCommon, a federated identity-management service that enables members to log on to other university networks using their home-institution credentials. This process enhances collaboration and resource sharing, which are fundamental to today's research.
Internet2 and its partners are currently launching a project to foster network innovation by providing the right campus environment. To this end, participating schools must commit to bring the 100 gigabit network to campus; install Science DMZ , an architecture change to the campus network that improves network flows, programmability, and performance; and install software-defined networking, a new networking technology. The idea, says Wolff, is to foster the kind of innovation that led to Yahoo, Facebook, and Google, by providing the right tools.
An exciting new networking innovation is OpenFlow, which will automate the provisioning of network resources and help scientists and network operators obtain bandwidth on demand.
Innovations in networking support the kind of collaborative HPC and data sharing that's become the norm in scientific research. "The reality is that researchers work on a global scale," explains Rodney Wilson, senior director for external research at Ciena , which makes the fiber optics used in the Internet2 backbone. "They're collaborating with other researchers all over the world. That's why these high-performance networks are so necessary."
Follow a Hybrid Approach
After his experience, Lussier feels that the best solution is not an either-or proposition. The best value, he believes, often lies in a hybrid of in-house and external solutions. An in-house HPC cluster running at full capacity provides the best return on investment, while cloud solutions give research teams a flexible conduit for additional work or analysis that must be completed quickly.
According to Gannon, the hybrid model is already popular among commercial customers, who use cloud-based options when they need to get something done but their on-premise computing is running at capacity. The cloud HPC system "gives them as much as they need, for as long as they need it," he explains. "And they can stop paying for it when they're done."
Large-Scale Computing Through Collaboration
Pete Siegel, CIO at the University of California, Davis, is thinking about data. In particular, he's thinking about the exponentially increasing amounts of data generated by genomic sequencing, satellite surveys, seismic sensors, viticulture sensors--you name it. By around 2015, the amount of data from just one discipline, such as genetics, will equal the amount of data produced in the entire world in 2010. And he's worried about how universities are going to contend with all of it.
"No campus can build the resources that it needs in this space," explains Siegel. "The data requirements are so large that it has to be done through collaborations and aggregation."
Siegel envisions enormous regional data centers, perhaps run by big institutions like Google, Amazon, Microsoft, or the National Security Agency that have experience managing big data, or perhaps created by consortia of universities. To sort through and analyze all this data, high-performance computing clusters would be situated near these data centers to reduce the distance that the data would have to travel.
"You would still have HPC on campus, but it would be for visualizing matter and for specialized high-performance computing and experiments on the data," posits Siegel. In fact, Siegel expects the lower cost of HPC hardware to result in multiple clusters in every lab on campus, and throughout hospitals as well.
Siegel hopes to complete a feasibility study of this model by the end of the year. "Universities that try to solve this problem on their own will fail," he warns. "It's not just about putting our heads together. It's about the efficiencies of putting all the resources in one place."
Siegel's vision of organized colonies housing exascale-level data paired with HPC clusters for analyzing the data seems almost utopian compared to the current state of affairs. At the moment, researchers anticipate losing a major data set every nine or 10 months, either because the disks on which the data is stored are lost or because data is jettisoned as researchers run out of storage space. As for infrastructure, at many schools it's currently quicker to FedEx large data sets to colleagues than to upload them via the school network.