Cornell to Lead NSF-Funded Cloud Federation for Big Data Analysis

The National Science Foundation continues to take steps to stimulate the nation's big data infrastructure and capabilities in partnership with research universities. The same week it announced the award of grants totaling more than $5 million to establish four regional hubs for data science innovation, NSF also sponsored a five-year, $5 million project led by Cornell University (NY) to design a federated cloud to support scientists and engineers requiring flexible workflows and analysis tools for large-scale data sets.

Known as the Aristotle Cloud Federation, the federated cloud will be comprised of data infrastructure building blocks (DIBBs) deployed at Cornell, the University at Buffalo and the University of California, Santa Barbara and shared by seven science teams with more than 40 global collaborators. (The project name was chosen because Aristotle's concept "the whole is greater than the sum of its parts" reflects the multi-institutional synergy and collaborations that the federation aspires to create.)

David Lifka, director of the Cornell Center for Advanced Computing, said Aristotle aims to develop a federated cloud model that encourages and rewards institutions for sharing large-scale data analysis resources that can be expanded internally with common, incremental building blocks and externally through collaborations with other institutions, commercial clouds and NSF cloud resources.

Aristotle's initial uses — earth and atmospheric sciences, finance, chemistry, astronomy, civil engineering, genomics and food science — were chosen to demonstrate the value of sharing resources and data across institutional boundaries. One goal is to accelerate the time it takes researchers to obtain scientific results.

As an example, geospatial data such as earth observation and climate simulations are scattered around the world within the data archives of researchers, government and the private sector. Varun Chandola, a computer science and engineering researcher at the University of Buffalo, is working with colleagues at NASA Ames, Oak Ridge National Laboratory and several universities on streamlining the integrated visualization and analysis of geo-data. He said they plan to use Aristotle to develop a cloud-based solution that allows researchers to seamlessly integrate heterogeneous geo-data from a variety of sources into a cloud-based analysis engine.

The elasticity provided by sharing resources means researchers don't have to wait for local resources to become available to get their project started. The Aristotle Cloud Federation plans to gather metrics provided by UB's XDMoD (XD Metrics on Demand) and UCSB's QBETS (Queue Bounds Estimation Time Series) to inform the decisions of researchers and administrators about when to use federated resources outside their own institutions.

"Efficient use of federated clouds requires the ability to make predictions about where a workload will run best," explained Rich Wolski, professor of Computer Science at UCSB, in a statement. "Using XDMoD data and cloud-embedded performance monitors, QBETS will make it possible to predict the effects of federated work-sharing policies on user experience, both in the DIBBs cloud and in the Amazon Web Services Cloud."

Using a new allocation and accounting model, administrators will be able to track utilization across federated sites and use this data as an exchange mechanism between partner sites.

Regional Big Data Hubs

The NSF-funded regional hubs for data science innovation are being coordinated by data scientists at Columbia University (Northeast Hub); Georgia Institute of Technology and the University of North Carolina (South Hub); the University of Illinois at Urbana-Champaign (Midwest Hub); and the University of California, San Diego, UC Berkeley and the University of Washington (West Hub).
Building upon the National Big Data Research and Development Initiative announced in 2012, the awards are made through the Big Data Regional Innovation Hubs program, which creates a new framework for multi-sector collaborations among academia, industry and government. The "big data brain trust" assembled by the hubs will conceive, plan and support regional big data partnerships and activities to address regional challenges.

Other Cloud Consortia

Other organizations are working on developing big data cloud consortia. The nonprofit Open Cloud Consortium (OCC) is a shared cloud computing infrastructure for medium-size, multi-institution big-data projects. The OCC has grown to include 10 universities, 15 companies and five government agencies and national laboratories.

"We started before the current interest at NSF and other funding agencies in big data and data science," OCC Director Robert Grossman told Campus Technology in April 2015. Grossman, who is a professor in the division of biological sciences at the University of Chicago (IL), said, "There just wasn't an interest in data-intensive science or big data or supporting data repositories at scale. Rather than wait for NSF to become interested in this, we decided to do it on our own." The initial participating universities were Northwestern (IL), the University of Illinois at Chicago, Johns Hopkins (MD) and the University of California, San Diego, he said. "We set up a distributed cloud with a number of scientific data sets, which was the first version of the Open Science Data Cloud."

Another example is the Massachusetts Open Cloud (MOC), the result of a collaboration between the commonwealth, local universities and industry. According to its Web site, the idea behind the MOC is to enable multiple entities to provide (rather than just consume) computing resources and services on a level playing field. Companies, researchers and innovators will be able to make hardware or software resources available to a large community of users through an Open Cloud Exchange.

Featured

  • row of students using computers in a library

    A Return to Openness: Apereo Examines Sustainability in Open Source

    Surprisingly, on many of our campuses, even the IT leadership responsible for the lion's share of technology deployments doesn't realize the extent to which the institution is dependent on open source. And that lack of awareness can be a threat to campuses.

  • server racks, a human head with a microchip, data pipes, cloud storage, and analytical symbols

    OpenAI, Oracle Expand AI Infrastructure Partnership

    OpenAI and Oracle have announced they will develop an additional 4.5 gigawatts of data center capacity, expanding their artificial intelligence infrastructure partnership as part of the Stargate Project, a joint venture among OpenAI, Oracle, and Japan's SoftBank Group that aims to deploy 10 gigawatts of computing capacity over four years.

  • colorful panels depicting university housing, meal plans, data analytics, forms, and a student

    New Thesis Elements Student Life Module Integrates Housing, Meal Plans, and Student Services

    Student information system provide Thesis Elements recently launched a new Student Life module that enables institutions to manage student housing assignments, meal plans, billing, and other student services from within the Elements SIS platform.

  • laptop displaying a glowing digital brain and data charts sits on a metal shelf in a well-lit server room with organized network cables and active servers

    Cisco Introduces AI-First Approach to IT Operations

    At its recent Cisco Live 2025 event, Cisco announced AgenticOps, a transformative approach to IT operations that integrates advanced AI capabilities to enhance efficiency and collaboration across network, security, and application domains.