Cornell to Lead NSF-Funded Cloud Federation for Big Data Analysis
The National Science Foundation continues to take steps to stimulate the nation's big data infrastructure and capabilities in partnership with research universities. The same week it announced the award of grants totaling more than $5 million to establish four regional hubs for data science innovation, NSF also sponsored a five-year, $5 million project led by Cornell University (NY) to design a federated cloud to support scientists and engineers requiring flexible workflows and analysis tools for large-scale data sets.
Known as the Aristotle Cloud Federation, the federated cloud will be comprised of data infrastructure building blocks (DIBBs) deployed at Cornell, the University at Buffalo and the University of California, Santa Barbara and shared by seven science teams with more than 40 global collaborators. (The project name was chosen because Aristotle's concept "the whole is greater than the sum of its parts" reflects the multi-institutional synergy and collaborations that the federation aspires to create.)
David Lifka, director of the Cornell Center for Advanced Computing, said Aristotle aims to develop a federated cloud model that encourages and rewards institutions for sharing large-scale data analysis resources that can be expanded internally with common, incremental building blocks and externally through collaborations with other institutions, commercial clouds and NSF cloud resources.
Aristotle's initial uses — earth and atmospheric sciences, finance, chemistry, astronomy, civil engineering, genomics and food science — were chosen to demonstrate the value of sharing resources and data across institutional boundaries. One goal is to accelerate the time it takes researchers to obtain scientific results.
As an example, geospatial data such as earth observation and climate simulations are scattered around the world within the data archives of researchers, government and the private sector. Varun Chandola, a computer science and engineering researcher at the University of Buffalo, is working with colleagues at NASA Ames, Oak Ridge National Laboratory and several universities on streamlining the integrated visualization and analysis of geo-data. He said they plan to use Aristotle to develop a cloud-based solution that allows researchers to seamlessly integrate heterogeneous geo-data from a variety of sources into a cloud-based analysis engine.
The elasticity provided by sharing resources means researchers don't have to wait for local resources to become available to get their project started. The Aristotle Cloud Federation plans to gather metrics provided by UB's XDMoD (XD Metrics on Demand) and UCSB's QBETS (Queue Bounds Estimation Time Series) to inform the decisions of researchers and administrators about when to use federated resources outside their own institutions.
"Efficient use of federated clouds requires the ability to make predictions about where a workload will run best," explained Rich Wolski, professor of Computer Science at UCSB, in a statement. "Using XDMoD data and cloud-embedded performance monitors, QBETS will make it possible to predict the effects of federated work-sharing policies on user experience, both in the DIBBs cloud and in the Amazon Web Services Cloud."
Using a new allocation and accounting model, administrators will be able to track utilization across federated sites and use this data as an exchange mechanism between partner sites.
Regional Big Data Hubs
The NSF-funded regional hubs for data science innovation are being coordinated by data scientists at Columbia University (Northeast Hub); Georgia Institute of Technology and the University of North Carolina (South Hub); the University of Illinois at Urbana-Champaign (Midwest Hub); and the University of California, San Diego, UC Berkeley and the University of Washington (West Hub).
Building upon the National Big Data Research and Development Initiative announced in 2012, the awards are made through the Big Data Regional Innovation Hubs program, which creates a new framework for multi-sector collaborations among academia, industry and government. The "big data brain trust" assembled by the hubs will conceive, plan and support regional big data partnerships and activities to address regional challenges.
Other Cloud Consortia
Other organizations are working on developing big data cloud consortia. The nonprofit Open Cloud Consortium (OCC) is a shared cloud computing infrastructure for medium-size, multi-institution big-data projects. The OCC has grown to include 10 universities, 15 companies and five government agencies and national laboratories.
"We started before the current interest at NSF and other funding agencies in big data and data science," OCC Director Robert Grossman told Campus Technology in April 2015. Grossman, who is a professor in the division of biological sciences at the University of Chicago (IL), said, "There just wasn't an interest in data-intensive science or big data or supporting data repositories at scale. Rather than wait for NSF to become interested in this, we decided to do it on our own." The initial participating universities were Northwestern (IL), the University of Illinois at Chicago, Johns Hopkins (MD) and the University of California, San Diego, he said. "We set up a distributed cloud with a number of scientific data sets, which was the first version of the Open Science Data Cloud."
Another example is the Massachusetts Open Cloud (MOC), the result of a collaboration between the commonwealth, local universities and industry. According to its Web site, the idea behind the MOC is to enable multiple entities to provide (rather than just consume) computing resources and services on a level playing field. Companies, researchers and innovators will be able to make hardware or software resources available to a large community of users through an Open Cloud Exchange.