Data Management | Feature
The Data-Storage Crisis
With the amount of data predicted to grow 800 percent by 2016, higher ed faces a desperate race to develop strategies to store and manage the tidal wave of information.
- By Barbara Ravage
Illustration by Shaw Nielsen
Colleges and universities are running out of closet space. In 2011, Gartner predicted that the volume of higher ed data would grow 800 percent over the next five years, making auditing, archiving, and recovery increasingly complex. Unfortunately, many IT departments, particularly those in the public sector, have flatlining budgets--and no money to build additional closets.
In the eyes of many, it's a crisis that has the potential to become a strategic disaster. As far as Gartner is concerned, time is already running out: The research company identified the critical time frame for action as 2010-2013. Regina Kunkle, VP of state and local education for NetApp, shares Gartner's sense of urgency. "Data storage is 40-50 percent of your budget," she warns, "and growing anywhere from 50-100 percent a year."
What is all of this data and why is it so important? That's exactly what Lehigh University (PA) decided to find out in 2011. "We developed a charge to better understand storage requirements, campus practices, and customer needs in order to make strategic storage investments and improve storage service to the campus community," explains James Young, director for administration and planning in the Office of Library and Technology Services.
Young's team identified four factors driving the growing demand for data storage:
- A push to enhance campus research
- A shift in library needs from physical storage to storage-intensive digital projects and a new institutional repository
- New federal grant requirements for longer retention of data
- Increasingly complex storage and backup needs on the part of users, including media content, transactional data, social media, and research data
The factors identified at Lehigh are echoed at institutions across the country. The challenge lies in how to meet those needs efficiently and cost-effectively. While institutions now have a wide array of storage options from which to choose--including cloud-based operators--each solution comes with its own pros and cons. Indeed, many schools may find that no single storage solution meets all their needs. One thing is clear, though: IT departments and their constituents will have to find the money somewhere to tackle the challenge.
For many researchers, cost is definitely a driving factor. Typically, the fees needed for data storage must come out of their grant funds, a resource that is not only finite but comes with a definitive end date. The Oklahoma PetaStore has been structured to accommodate this reality. Established by the University of Oklahoma's Supercomputing Center for Education & Research (OSCER) with a grant from the National Science Foundation, the PetaStore employs a novel business model for archival storage. University-affiliated researchers are charged only for storage media--disk drives or tape cartridges--while the university covers the cost of space, power, cooling, maintenance, hardware, and software.
Because users are not charged for the amount of time their data is stored, the PetaStore solves a problem inherent in cloud and other storage solutions: Time-based storage fees can be problematic, especially now that federal grant regulations have extended the time period for data retention.
"If you have a mechanism of storing data that requires a monthly or annual payment--and you're paying by the gigabyte--at the end of the project you don't have any more money," says Henry Neeman, executive director of research computing and services as well as director of OSCER. "You've either spent it, or it went back to the funding agency, but you've got an obligation to retain the data on your own dime. For large-scale data sets, which are increasingly common, that's simply not practical."
But there's another reason why schools might want to invest in a homegrown solution for research storage: Increasingly, both public and private funders are taking data-storage infrastructure into consideration when awarding grants--the better the infrastructure, the greater the likelihood a grant will be awarded. And major institutions in higher education are taking notice. "Thought-leading universities are building storage infrastructure as a service for their researchers in anticipation of grant requirements," notes Kunkle.
Archived and Library Materials
The Oklahoma PetaStore is designed for large-scale research storage, where data sets are parked for the long term and not accessed frequently. A different approach is needed for library materials that are used constantly or, in the case of rare collections, require specialized storage considerations.
It's an issue that the University of Virginia has wrestled with during the development of an institution-wide data-management plan. For its institutional repository and rare collections, the school currently employs both archival storage (for long-term preservation of documents and library holdings) and enterprise-class storage (a network-based system used for overall storage), with data centers at two separate sites.
"We have a repository sitting on top of enterprise-class storage," explains Robin Ruggaber, director of the online library environment at the University of Virginia Libraries. "We have people accessing data very quickly, and that works very well for us." Although the setup serves the day-to-day needs of students and researchers, Ruggaber finds it wanting in other ways. "It doesn't give us the level of preservation and the diversity of geographic location we need," she notes, "and it costs more than cloud storage."
As a result, UVa has recently begun working with DuraCloud to extend its preservation strategy into the cloud. DuraCloud is an open source software-as-a-service (SaaS) solution developed by DuraSpace to facilitate storage of archival material. In the academic sphere, its primary users are libraries and other repositories of cultural-heritage material.
Institutions using DuraCloud may choose to store data with one or more of its cloud storage partners--Amazon Web Services, Rackspace, or the San Diego Supercomputer Center at the University of California, San Diego. According to Michele Kimpton, CEO of DuraSpace, DuraCloud provides a layer of preservation-based services, including replication and synchronization of content across all storage options.
"It gives institutions an easy pathway to using a blend of commercial and academic cloud storage without taking on all the risk," she notes. "Because we're not-for-profit in the academic space, we're working on their behalf to make sure that their content is safe, that we provide health checks for the content, and that they don't get locked in to a single player."
Although DuraCloud's fee structure varies depending on service levels and storage amounts, UVa is willing to shoulder the cost for peace of mind. "We think some types of content--both within our institutional repository for scholarly work and also for our rarest collections--require having copies also stored in a different manner to get diversity in terms of architecture," notes Ruggaber.
In considering cloud-based storage solutions, institutions must weigh two other factors: security and latency. From UVa's perspective, security is not a major concern, says Ruggaber, because DuraCloud uses Shibboleth authentication, "which allows institutions to get their content and only their content." More important, the university is careful about the types of materials it entrusts to the cloud. "We are not putting sensitive data into the cloud at this time," asserts Ruggaber, "nor do we intend to."
Latency is an issue, however, "With local storage and a dedicated network path between server and storage, you can move data very quickly," explains Ruggaber. "When you're going to cloud storage, which we're testing on Amazon now, you have to go through the commodity network, and that slows down things drastically."
A potential solution that addresses both security and latency concerns is a private cloud. It's an approach that works well for the Virginia Community College System, which maintains a SaaS cloud for its 23 colleges and 40 campuses, including a Blackboard e-Education component and Oracle's PeopleSoft Campus Solutions.
But not all private clouds are created equal. Matt Lawson, director of enterprise services for VCCS, cautions schools to avoid what he calls "dumb storage"--the tech equivalent of that overstuffed closet at home where you can never find anything. This was a problem faced by VCCS, Lawson says, compounded by the fact that there was no interoperability between the system's two data centers. "There was no added-value technology on top," he explains, "so there was no good management interface or management software."
That all changed six years ago when VCCS implemented data-storage solutions from NetApp, replacing tape backups and fiber-channel SAN-based protocols. The main data center in Richmond now runs a FAS6000 series enterprise-class storage solution, while the disaster-recovery center in Roanoke uses a FAS3070 midtier storage system (NetApp's latest midtier systems are the FAS3100 and FAS3200). Interoperability is no longer a problem.
What particularly appeals to Lawson are the space-saving technologies offered by NetApp, including data compression, de-duplication, virtualization, and thin provisioning. "We're managing probably 100 times more storage with less IT staff than before we went to NetApp," he notes. Although three engineers are trained to handle the system, Lawson estimates that storage management takes up less than one FTE.
The Appeal of Consortia
Private clouds are not necessarily limited to single institutions. Given the budget constraints afflicting higher education, many schools are looking to consortia that can provide economies of scale, as well as the promise of secure, accessible, and expandable storage. That's what drives the Digital Preservation Network (DPN), a federation of 55 institutions that was "created by research-intensive universities to ensure long-term preservation of the complete digital scholarly record," according to its website. Project Hydra is another multi-institutional collaboration that seeks a collective solution to storage issues that will eclipse the capability of any single university working on its own.
UVa is active in both Project Hydra and DPN, as well as a third solution--the Academic Preservation Trust--in partnership with DuraSpace. A consortium of 13 institutions, the APTrust will play a dual role: At the local level, it will be an archival environment for participating members, including disaster-recovery services; at the national level, it will function as a replicating node of DPN.
In Ruggaber's view, the involvement of DuraSpace in the APTrust initiative has several benefits, not least the fact that its software is open source, flexible, and built to serve the needs of higher ed. "[DuraSpace] has already solved the multi-tenancy problem: giving each of our partner institutions individualized space where their data is segmented," she notes. Furthermore, the bandwidth advantages of Internet2 should go far toward addressing the latency issues that have worried Ruggaber.
Groundwork for DPN and the APTrust started in 2011. "We're testing right now with UVa and the University of North Carolina," says Ruggaber. "We expect to be in full production by December 2013." Just in time to meet the Gartner deadline for data-storage readiness.
How many schools can say the same?
Tape or Disk?
The choice of storage medium often elicits strong responses from IT folks. But both tape and disks have a role to play, depending on your school's needs. In many cases, a combination may be the best strategy.
Without doubt, tape is slower than disks, and a tape library comes with significant fixed costs. But, depending on your circumstances, these may not be factors. Consider the Oklahoma PetaStore at the University of Oklahoma, for example. The PetaStore is intended as a long-term parking lot for large-scale research data, which is an ideal use for tape. "The general rule is, 'Write once, read seldom,'" says Henry Neeman, executive director of research computing and services as well as director of the Supercomputing Center for Education & Research. "The files are not constantly being moved on and off, as with a disk system."
Plus, because individual researchers pay only for the media, tape is a much cheaper option than disks. "One of the key advantages of tape in our business model is that the fixed costs--hardware, software, and maintenance--are handled by the grant and by the university," explains Neeman.
Finally, there's an energy consideration. "A rack of disk drives that's reasonably active draws as much power as a rack of supercomputer servers," says Neeman. "But a rack of tape draws almost no power at all, so there's a green value as well."
Make no mistake, the PetaStore is not wedded to tape. Indeed, it goes out of its way to give researchers as many options for data storage and retrieval as possible. Users can choose storage media and duplication options on a file-by-file basis: one copy on disk only; one copy on disk and one on tape; one copy on tape only; or two copies on tape.
On the other hand, Matt Lawson, director of enterprise services for the Virginia Community College System, considers tape something of a four-letter word. Back in the bad old days before VCCS adopted NetApp to power its private cloud, the college system relied on tape backups. Lawson recalls one particular tape restore that took about two weeks. "For a while the engineers thought we weren't going to be able to do the recover at all," he says. "Now we can do a multi-terabyte recovery in minutes using NetApp's Snapshot."
The ability to recover data quickly is critical and schools should make disaster recovery a central part of their storage strategy. Unfortunately, that is not the case today. "Surprisingly, higher ed is not as far along that path as you would think," says Regina Kunkle, VP of state and local education for NetApp, who believes disaster recovery is the number one challenge facing higher education in the storage arena. It's not simply a matter of having redundant data centers or backup in the cloud, she notes. Meeting recovery-point and recovery-time objectives often depends on the media--tape or disk--and the methods chosen for storage.