A Data Commons for Scientific Discovery

The Open Cloud Consortium is working to meet the collaboration and data-management needs of multi-institution big data projects.

A Data Commons for Scientific Discovery

Teams of science researchers with big-data needs who want to collaborate often have difficulty finding venues to host their work. While the very largest projects, such as the Large Hadron Collider and Large Synoptic Survey Telescope, have the resources to build their own infrastructure, and individual researchers with smaller projects can outsource to providers like Amazon Web Services, those in the middle range have fewer options.

So in 2008, a group of researchers came together to form the nonprofit Open Cloud Consortium (OCC), a shared cloud-computing infrastructure for medium-size, multi-institution big-data projects. The OCC has grown to include 10 universities, 15 companies and five government agencies and national laboratories. In a recent interview with Campus Technology, OCC Director Robert Grossman discussed the organization's relationship to research universities' IT departments, as well as its business model and sustainability challenges.

Sharing Data

"We started before the current interest at NSF and other funding agencies in big data and data science," said Grossman, who is a professor in the division of biological sciences at the University of Chicago (IL). "There just wasn't an interest in data-intensive science or big data or supporting data repositories at scale. Rather than wait for NSF to become interested in this, we decided to do it on our own." The initial participating universities were Northwestern (IL), the University of Illinois at Chicago, Johns Hopkins (MD) and the University of California, San Diego, he said. "We set up a distributed cloud with a number of scientific data sets, which was the first version of the Open Science Data Cloud."

Even with the increased focus on big data and data science these days, there are still not many large-scale science clouds, he noted. "There are not many that are independent and multi-institutional like we are. It is kind of interesting that even after all this time we are still unique, which I find kind of surprising."

OCC is based on the idea of a "data commons," which Grossman described as a collection of scientific data either within a discipline or across disciplines. "The idea is that co-locating compute over that data allows for discovery that might not be possible if you were just looking at your own data set," he said. One of the motivations for creating OCC was to make it easier to create commons of data that would support discovery.

"Part of what we are trying to do with the Open Science Data Cloud is make it easier for people to publish their data sets," Grossman said, and by and large most universities aren't able to do that. "Only a handful of them are set up for that, because there are real costs involved," he explained. In addition, Grossman said, it became clear at the outset that if the OCC organization were owned by any one university it would be difficult for the other universities to participate, "so we set it up as an independent 501(c)(3) to make it easier for us to be a neutral player across organizations."

How It Works

In line with the concept of a commons, each OCC project is managed and governed by a working group that sets up the rules for that project in a collaborative way.

  • The OCC Open Science Data Cloud Working Group manages and operates the OSDC, which is a petabyte-scale science cloud for researchers to manage, analyze and share their large data sets. It is one of the largest general-purpose science clouds in the world.
  • The OCC Biomedical Commons Cloud Working Group is developing an open cloud-based infrastructure for sharing medical and healthcare data in a secure and compliant fashion to support biomedical research.
  • Project Matsu is a collaboration between the NASA Goddard Space Flight Center and the Open Cloud Consortium to develop open source technology for cloud-based processing of satellite imagery to support the earth science research community as well as human-assisted disaster relief. This working group develops and operates the OCC Matsu Cloud.

The basic business model is that some of the larger projects earmark a portion of their grant funding to be used for computing infrastructure services. In addition, OCC sometimes receives block grants or donations of equipment, and it will donate core hours through an allocation committee to meritorious projects. The third way is that projects can pay as they go.

OCC in Action

Maria Patterson, scientific data lead for the OSDC at the University of Chicago, also works on Project Matsu, for which the OCC hosts earth satellite imagery data in its cloud. Project Matsu provides a platform for NASA scientists to process the satellite data, and OCC also makes the data available to the public. "We get the data 24 hours after it is observed by the satellite," Patterson explained. "We do in-house processing on it to come up with new algorithms. We can slot in new analytics we want to apply and run over all the data. For instance, we could look at every pixel and classify it as vegetation, water, cloud or dry land." One practical use is helping Namibia identify flood-prone regions to help with flood and drought risk-management efforts.

OCC also features a Bionimbus Protected Data Cloud to host human genomic data in a secure, regulatory compliant way that gives access to researchers interested in working with that data. 

The OCC Web site gives other examples of researchers using its resources. For example, OSDC supports the "Bookworm" project of Harvard University's (MA) Cultural Observatory, offering a way to interact with digitized book content and full text search. Ben Schmidt, an assistant professor of history at Northeastern University and former graduate fellow at the Cultural Observatory, is quoted as saying that Harvard uses the OSDC to process and provide fast, structured access to the data from huge digital libraries: "Currently that means a public-facing visualization at arxiv.culturomics.org that scientists can use to explore that set, and a few databases under development including millions of newspaper pages, historical journal articles, as well as alternate routes for exploring the multi-dimensional data made possible by indexing the data at a fine-grained level and constructing queries after the fact."

Sustainability

Grossman said that although OCC is filling a valuable niche for science researchers and is growing, the sustainability model going forward is still an open question.

"We are operating at quite a large scale and there is still not the appetite in the community to pay for data and pay for usage," he said. "Especially for the larger projects, it can be quite expensive to do this. I tend to think of OSDC as an instrument. You have microscopes for small objects and telescopes for far objects; I think of the OSDC as a ‘datascope' for working with large data and making discoveries. The question for the community is how is that funded over time? Eventually we will probably move to a model in which grants have components of their funding that they can spend at science clouds. But we are just beginning to move to that model."

OCC's evolving role may be of keen interest to campus IT leadership, Grossman said. If a campus does not want to bring up its own dedicated science cloud to support the long-term maintenance of infrastructure to provide access to data, then it could buy in for a period of time to a third-party infrastructure. "And as a not-for-profit, we are one of the options out there for universities that want to participate but that don't necessarily have their own infrastructure for sharing data. This is something universities are thinking about and it is an option we provide."

He said among the most important services OCC can provide is a way to easily export data from the OSDC to another cloud. It provides a metadata service, and an upload and download service that works over the high-performance networks that most research institutions have. "You have to be able to get your data out and into other science clouds," Grossman said. "A lot of our focus going forward is going to be making it easier for large-scale clouds to work together."

Other Cloud Consortia

Other organizations are working on developing big-data cloud consortia. For instance, the Massachusetts Open Cloud (MOC) is the result of a collaboration between the state, local universities and industry. According to its Web site, the idea behind the MOC is to enable multiple entities to provide (rather than just consume) computing resources and services on a level playing field. Companies, researchers and innovators will be able to make hardware or software resources available to a large community of users through an Open Cloud Exchange. When the MOC was introduced in 2014, Gloria Waters, vice president and associate provost for research at Boston University, said it would be the first public open cloud platform designed to spur collaboration, drive innovation and economic development in Massachusetts. She said it would "catalyze major research projects, and serve as a new model in cloud computing and big data. Ultimately, the Massachusetts Open Cloud aims to become an invaluable, self-sustaining R&D resource for the Commonwealth."

Featured