Digital Libraries >> 'Opening' A Digital Library

Digital libraries are not new, but open source, video, and collaborative digital repositories are changing the face of library science.

If only Melvil Dui (née Melville Dewey) could see what’s become of library sciences today. How would the father of the Dewey Decimal system categorize Web pages, which grow at an alarming rate of nearly seven million per day? How would he organize the hundreds of thousands of video files, and hundreds of thousands of audio files or podcasts that now supplement written words as content? Most importantly, how on Earth would Dewey—perhaps the most famous librarian of all time—manage to represent the body of any one university’s research notes, white papers, dissertations, and other assorted scholarly content in one card catalog?

Methinks the poor guy probably would go insane.

For modern-day librarians whose mission is to build collections and transmit today’s intellectual, cultural, and historical output to the future, the challenge is equally daunting. While many schools have responded with efforts to digitally scan their physical collections, a new wave of digital repositories designed to save only certain types of content are changing the face of library science everywhere. In particular, efforts at Stanford University (CA), Harvard Business School (MA), the University of California system, and the Massachusetts Institute of Technology stand out as innovations that could forever revolutionize the way we think about storing content. Good ol’ Dewey— spelling-reform eccentricities and all— would be proud.

Establishing “Cache” Flow

Unless you’ve been living in a cave, you no doubt know that librarians and technologists at Stanford made headlines in December 2004 when they unveiled a controversial plan to work with Internet search engine Google to digitize an entire collection of more than eight million volumes (the plan has since been scaled back). Yet, while the news media focused on the Stanford librarians involved in the deal, the university’s Victoria Reich was hard at work on a digital library project of her own, a revolutionary repository known as LOCKSS. The project, an acronym for “Lots of Copies Keep Stuff Safe,” revolves around open source software that provides institutions with an easy and inexpensive way to collect, store, preserve, and provide access to local copies of authorized content they have purchased from publishers. According to Reich, the project director, it just might revolutionize the way schools store electronic journals for generations.

LOCKSS/Stanford University
Because the LOCKSS repository system is based on free, open source software, it’s an inexpensive way for institutions to preserve electronic journals. Aside from the standard journal subscription fees, the only additional requisite costs are the money for a computer to run the software, and the relatively modest fee to join the LOCKSS Alliance.

In fact, the LOCKSS project began in 1999, when it became clear to Reich and others that scholarly journals were moving from paper to the ’Net, and libraries were quickly finding that they were not prepared to grow online content collections accordingly. The result was a tool that would allow them to build electronic journal collections easily and affordably. Today, instead of focusing on the construction of a single centralized repository as in other digital library initiatives, the LOCKSS project takes a more decentralized approach. Member schools can download the open source software and get free upgrades by following links on lockss.stanford.edu. Currently, more than 90 colleges and universities are running “LOCKSS boxes”—computers with the open source software running live. On each campus, the computers can store 3,000 years of journal articles apiece.

“In the physical world, libraries have succeeded until now because they are so loosely coupled,” says Reich, pointing to the advantages of decentralization and autonomy. “We looked at LOCKSS and realized that it made perfect sense to embrace the same model that’s worked wonderfully for hundreds of years.”

How d'es it work? Essentially, the LOCKSS approach facilitates a cache of data that simply never gets flushed. The technology employs a three-step process for preserving data that requires minimal human intervention beyond the initial setup. Via the first step—which Reich calls “ingest”—the system crawls publisher Web sites to collect new content as it appears, and performs audit and quality control on what it finds. In the second step, known as “preservation,” the system preserves content by saving a file in its original format (as required by archivists). If migration to a more accessible format is required, LOCKSS engineers that migration, too. Finally, in a step dubbed “dissemination,” the system acts as a Web proxy, supplying content for every URL at which a particular file originally was published. In other words, the system establishes the link to original material.

California Digital Library

THANKS TO THE CALIFORNIA DIGITAL LIBRARY INITATIVE, anyone in the UC system can submit his copyrighted object and versions for preservation, and they will be retained in perpetuity.

Because the LOCKSS system is based on free, open source software, Reich claims it is one of the cheapest ways for institutions to tackle the issue of preserving electronic journals. Aside from standard journal subscription fees, the only additional requisite costs involved are the money for a computer to run the software, and the fee to join the LOCKSS Alliance. Currently, membership fees vary by school size, and range from $1,080 to $10,800 (for the largest institutions). Reich says that as the program weans itself off grant funding from the National Science Foundation ( www.nsf.gov) and the Andrew W. Mellon Foundation (www.mellon.org), these fees may rise slightly to cover costs of operation. Ultimately, she says, the fees are secondary— so long as the project has enough money to sustain itself, Reich says its goal simply is to make preservation cheap and reliable. As it stands now, schools are putting tens of thousands of dollars toward non-LOCKSS processes, annually.

“Although libraries are one of our society’s few memory collections, it’s no secret that we have no money,” she says. “If we are building something worthwhile that we expect to help [libraries] fulfill their missions over time, it darn well better be inexpensive.”

Like Netflix, Only Better

While the LOCKSS system catalogs content from electronic journals, a new digital repository effort at the Harvard Business School focuses on another valuable medium of information: videos. The Harvard project revolves around a new system called VideoTools—an elaborate media portal through which the school’s extensive collection of video assets is automatically coded, managed, shared and published. The VideoTools system provides video content for singular classroom delivery, special events, or course-specific compilations. What’s more, according to Larry Bouthillier, director of Educational Technology and Multimedia Development, because every video in the database is tied to a distinct URL, faculty members can easily link to videos in e-mail, lectures, and standard Web pages.

Still, the VideoTools system didn’t happen overnight. Because the business school teaches almost entirely via case studies, over the years the institution has amassed quite a library of videos to extend and amplify most lessons. In 1998, Bouthillier oversaw a strategic initiative to digitize more than 1,200 videos; by January 2004, with thousands of new videos waiting for digitization, the school was in dire need of another plan. With a mix of homegrown J2EE programming and new technologies, including Helix DNA servers from RealNetworks (www.real.com), a back-end digital asset management solution from ClearStory Systems (www.clearstorysystems.com), and encoding automation software from Virage (www.virage.com), the school set out to perform the Herculean task of rearchitecting its video library from the ground up.

“There were so many interdependencies that when we set out to upgrade the system, we realized that we had to rebuild everything all at once,” Bouthillier says, looking back on the six-figure project. “We knew this would be a big deal, but I don’t think any of us realized just how much work we’d have to put into making our video library what we wanted it to be.”

Today, once Harvard Business School technicians create an MPEG digital video file from a physical video tape, they drop it into a folder, and VideoTools d'es the rest: everything from populating the database to assigning each video its own locator URL. Whenever faculty members request a video, the system enforces access-control rules to confirm that the user should be allowed to view the content. Once this access is granted, the system’s Helix DNA servers stream video into classrooms at 1.2 mbps, a quality that is virtually undistinguishable from physical DVDs. VideoTools even enables business school faculty and staff members to create personal collections of video and multimedia content online, then share that folder with others as part of a mini-portal, or portlet, off the main VideoTools site.

All of the content in the VideoTools system is searchable by common criteria such as title, event location, and date. Every video also is cross-referenced by the faculty members, courses, and cases with which it is associated most frequently. Videos in the system have searchable transcripts, making the library fertile ground for advanced searches and what Bouthillier refers to as “serendipitous discoveries.” Even the text that appears in PowerPoint presentations and other digital demonstrations becomes searchable in VideoTools. Bouthillier says that at last check, the system contained more than 7,000 video files overall—literally terabytes of content in various formats including MPEG-1, MPEG-2, and RealVideo. Much like the knowledge of the Harvard Business School students, he adds, the library is designed to grow every day.

“We’ve laid the groundwork for a system that can scale with us over time,” he says. “When you’re talking about repositories for data of any kind, that’s the ultimate goal.”

Digital Library of Video

AT HARVARD BUSINESS SCHOOL, the WebTools video portal has remarkable management capability.

Preserving the American West

At some schools, new digital repository efforts are as much about collaboration as they are about preservation. Such is the case in the University of California system, where a number of initiatives under the auspices of the California Digital Library (CDL; www.cdlib.org) highlight a shared-services approach to preserving digital content of various kinds. Perhaps the most interesting of these initiatives is an effort dubbed the American West—a digital repository of physical and electronic information about the history of the Western US. In 2003, the CDL was awarded a three-year grant from the William and Flora Hewlett Foundation (www.hewlett.org), to assemble the collection. The project, which organizes material into 400 topics, draws upon resources from a variety of major research institutions, including Indiana University and the University of Washington, to name a few.

The repository contains information on everything from railroads to the Bonneville Power Authority and Japanese internment camps. Spearheading the endeavor is Robin Chandler, the CDL’s director of Built Content. Chandler says that instead of lifting actual files from participating institutions, the American West project harvests metadata (or relevant data about data) from other schools, and then presents links to the content files on their native servers. Because different schools create metadata in different formats, the project includes software designed to normalize these differences and present information on all material in a uniform format. Down the road, says Chandler, CDL hopes to package this information for libraries at other institutions to use to launch themed repositories of their own. “The whole idea is to extend our commitment to shared services by creating something that works for everyone,” she says.

Another CDL repository, the California Recall Project, aims to record all electronic content that was generated during California’s historic election to recall former Gov. Gray Davis in 2003. Working in cooperation with the Stanford University Computer Science Department and the federal San Diego Supercomputer Center (www.sdsc.edu), the CDL has crawled and saved thousands of Web sites associated with the election. According to Patricia Cruse, director of Digital Preservation for the CDL, the next step for the project will be to explore possibilities for presentation of these materials, and make them available to any school inside or outside of California that wants access.

Beyond these collaborative efforts, CDL embarked on another, more systemoriented repository effort in July: the Digital Preservation Repository (DPR). Based on a Java-client library and a Web services interface that uses the simple object access protocol (SOAP), the new system will establish a set of services for the long-term care of a variety of digital content types, including doctoral theses, lecture notes, research photographs, and more. Anyone in the UC system can submit an object for digital preservation, as long as the submitter owns the copyright to the item. According to Cruse, once the items have been submitted, CDL staff is responsible for checking submission errors, controlling user access, and retaining deposited objects and versions in perpetuity.

“Libraries need to be able to collect this information,” says Cruse. “With [the DPR] and some of our other efforts, we hope to help them do it a little more easily.”

DSpace / MIT
The DSpace initiative is one of the largest digital asset management projects in history. It enables academics to place their content in a free and trusted archive, get it indexed, get it on the Web, and make it easy to find associated metadata. It also manages that content over the long term, just as physical libraries would.
The Mother Ship

Not surprisingly, the biggest innovation in digital repositories today is underway at MIT. There, the DSpace project (dspace.mit.edu) is one of the largest digital asset management projects in history; an effort that enables academics to place their content in a free and trusted archive, get it indexed, get it on the Web, and easily find associated metadata. DSpace manages that content over the long term, just as physical libraries would. The DSpace system makes two identical copies of all data, catalogues metadata about the data, and gives each file a unique URL or Web address, much like the VideoTools device at the Harvard Business School. The address sticks for life, even if the archivist later wants to migrate a given file into a newer file format.

DSpace began in 2000 as a partnership between MIT and high-tech vendor Hewlett-Packard. At first, the effort was intended as a “breadth-first” approach to the problem of digital preservation; aprogram designed to leave no stone unturned. The first version was an end-to-end, out-of-the-box system containing functionality for the capture of digital content and associated metadata, storage management, indexing, and capabilities that enabled end users to search, browse and retrieve. Gradually, the two entities morphed the system to the open source community development model, allowing that community to add depth of its own. MacKenzie Smith, associate director for Technology at MIT Libraries, hails DSpace as an effort to reverse the trend of researchers losing their hold on physical research data, a growing problem in government and academia alike.

“Our faculty members are keeping their research under their desks, on lots of disks, and praying that nothing happens to it,” she says, noting that earlier this decade, some MIT researchers actually may have misplaced some of their early studies and communications that led to the creation of the Internet itself. “We have a long way to go,” she admits.

Today, the key to DSpace is flexibility. Because the software behind the system is open source, it is available for other institutions to adapt at their own leisure, and nearly 100 colleges and universities have already done so. Another benefit to the effort’s open source base is that users constantly are writing improvements to the code. In the last year, for instance, researchers at MIT and other schools have crafted new aspects of the software, such as an auditing feature that can verify whether a file has been corrupted or tampered with, and a system that checks accuracy when a file is migrated into a new format. These programmers continually test the DSpace code for weaknesses, making sure that the system is as “hardened” and secure it possibly can be as it continues to grow.

Down the line, Smith predicts that the biggest challenge to the DSpace effort will be a legal one: convincing (and subsequently reminding) faculty to retain their rights to archive material when the rights are up, so that schools don’t have to shell out additional money to utilize work that quite rightfully should be theirs to access free of charge. Understandably, researchers who publish seek the best deals to publish their work, and these deals frequently require them to fork over rights to a publisher. But change could be imminent: The National Institutes of Health (www.nih.gov) have changed their public access guidelines to include free electronic access to articles that come out of research funds. Smith says that if researchers followed this lead and changed the terms of their copyright agreements to allow for a copy in DSpace, the system could grow exponentially.

“We understand that if we can’t capture content, we won’t have anything to preserve and we’ll lose the scholarly record,” she says. “The next step for us is to come up with language [that] researchers can use when they go to publishers and say, ‘This is what we need to protect our work for the future.’”

comments powered by Disqus