Repositories

Sun, Stanford Working To Archive History

In May in San Francisco, experts from leading universities, libraries, and research institutions around the world met as part of an ongoing effort to address a pressing issue: archiving the world's history, right up to today.

The 180-plus participants, including librarians, academics, and technical storage experts, were part of the Preservation and Archiving Special Interest Group, or PASIG. It's an effort started a year ago by Sun Microsystems and Stanford University to address the immense and ongoing challenges of preserving and storing human knowledge.

It's critical to do this now, according to Art Pasquinelli, Sun's education market strategist and Sun PASIG organizer. First, Pasquinelli said, "we're finding that a lot of people are concerned about losing cultural heritage material.... Things are decaying, and we don't want to move these things any more than we have to." At the same time, he said, huge amounts of new digital content are created daily, raising urgent questions about how to manage and share massive data sets.

According to a 2007 report by research firm IDC, the amount of new digital information created, captured or replicated will grow sixfold in just four years, and the majority of the data will be created not by businesses, but by individuals or other "non-enterprises."
Enterprises and organizations, however, will be responsible for storing, securing, and protecting 85 percent of this new digital data, IDC predicted.

The sheer amount of information that implies means that experts will have to address technical issues around very large digital repositories, including data management and architecture, along with deciding where to locate the repositories and what formats and storage media to use.

It's a huge challenge that only gets bigger as the Internet distributes torrents of new content daily. Meanwhile, old paper documents, such as books and maps, get older and more fragile and shouldn't be transported any more than necessary.

What to save, who should have access, and what digital formats should be used--since computer file formats change over time--are some of the biggest issues that PASIG is addressing in ongoing discussions and sharing of best practices around archiving and preservation techniques.

Along with Sun and Stanford, participants in Sun PASIG include The Alberta Library, Bibliotheque Nationale de France, The British Library, The California Digital Library, Getty Research, The Johns Hopkins University, Oregon State University, University of Oxford, Swedish National Library, and Texas Digital Library.

The effort to share information and best practices on archiving and digital preservation is being spearheaded by Sun, which has a vested interest in storage technologies for huge amounts of data. Higher education institutions are also heavy participants, since they often have large library efforts already and the requisite expertise regarding what to save.

Technical issues around storage and preservation of such large volumes of data are hardly trivial. Questions about long-term storage of huge amounts of irreplaceable information can quickly devolve into highly technical questions about repository design, tiered storage, data management and digital asset management, open storage, data curation, immersive technology, repositories and federated archives, and Web 2.0 services.

University librarians and other experts come into the picture in helping to address issues around what to keep, Pasquinelli said. How large the audience is for a particular item is one criterion, but there are many others. "[Experts] have to make decisions about what content is really valuable and has a big audience, along with what to keep [that] is not going to be [used] all that much. Not all the data and not all the collections are equal in interest," Pasquinelli said.

Librarians work with architecture experts to decide what is most valuable, and what users will want, then tie that into the design of the storage architecture. Disks, for example, might be used for information that will be accessed most often. Tapes, which are more economical and more power-friendly, since they needn't constantly spin, might store data that will be accessed less often. Disaster recovery also enters into the picture: Software might send information to three locations, so that backup is built into the architecture.

Since storage media fade, everything also must be revamped on a planned basis, just as in a major data center. "Everything constantly has to be updated," Pasquinelli said. "The people, the media, the equipment it's running on, the applications, and the data itself: They all get old."

One conclusion of participants at the May meetings: Much remains to be done to address the many issues in this evolving field.
comments powered by Disqus