Digital Libraries >> 'Opening' A Digital Library
Digital libraries are not new, but open source, video, and collaborative
digital repositories are changing the face of library science.
If only Melvil Dui (née Melville Dewey) could see what’s become
of library sciences today. How would the father of the Dewey Decimal system
categorize Web pages, which grow at an alarming rate of nearly seven million
per day? How would he organize the hundreds of thousands of video files, and
hundreds of thousands of audio files or podcasts that now supplement written
words as content? Most importantly, how on Earth would Dewey—perhaps the most
famous librarian of all time—manage to represent the body of any one university’s
research notes, white papers, dissertations, and other assorted scholarly content
in one card catalog?
Methinks the poor guy probably would go insane.
For modern-day librarians whose mission is to build collections and transmit
today’s intellectual, cultural, and historical output to the future, the challenge
is equally daunting. While many schools have responded with efforts to digitally
scan their physical collections, a new wave of digital repositories designed
to save only certain types of content are changing the face of library science
everywhere. In particular, efforts at Stanford University (CA),
Harvard Business School (MA), the University of California
system, and the Massachusetts Institute of Technology stand
out as innovations that could forever revolutionize the way we think about storing
content. Good ol’ Dewey— spelling-reform eccentricities and all— would be proud.
Establishing “Cache” Flow
Unless you’ve been living in a cave, you no doubt know that librarians and
technologists at Stanford made headlines in December 2004 when they unveiled
a controversial plan to work with Internet search engine Google to digitize
an entire collection of more than eight million volumes (the plan has since
been scaled back). Yet, while the news media focused on the Stanford librarians
involved in the deal, the university’s Victoria Reich was hard at work on a
digital library project of her own, a revolutionary repository known as LOCKSS.
The project, an acronym for “Lots of Copies Keep Stuff Safe,” revolves around
open source software that provides institutions with an easy and inexpensive
way to collect, store, preserve, and provide access to local copies of authorized
content they have purchased from publishers. According to Reich, the project
director, it just might revolutionize the way schools store electronic journals
for generations.
LOCKSS/Stanford University |
Because the LOCKSS repository system is based on free, open source software, it’s an inexpensive way for institutions to preserve electronic journals. Aside from the standard journal subscription fees, the only additional requisite costs are the money for a computer to run the software, and the relatively modest fee to join the LOCKSS Alliance. |
In fact, the LOCKSS project began in 1999, when it became clear to Reich and
others that scholarly journals were moving from paper to the ’Net, and libraries
were quickly finding that they were not prepared to grow online content collections
accordingly. The result was a tool that would allow them to build electronic
journal collections easily and affordably. Today, instead of focusing on the
construction of a single centralized repository as in other digital library
initiatives, the LOCKSS project takes a more decentralized approach. Member
schools can download the open source software and get free upgrades by following
links on lockss.stanford.edu.
Currently, more than 90 colleges and universities are running “LOCKSS boxes”—computers
with the open source software running live. On each campus, the computers can
store 3,000 years of journal articles apiece.
“In the physical world, libraries have succeeded until now because they are
so loosely coupled,” says Reich, pointing to the advantages of decentralization
and autonomy. “We looked at LOCKSS and realized that it made perfect sense to
embrace the same model that’s worked wonderfully for hundreds of years.”
How d'es it work? Essentially, the LOCKSS approach facilitates a cache of data
that simply never gets flushed. The technology employs a three-step process
for preserving data that requires minimal human intervention beyond the initial
setup. Via the first step—which Reich calls “ingest”—the system crawls publisher
Web sites to collect new content as it appears, and performs audit and quality
control on what it finds. In the second step, known as “preservation,” the system
preserves content by saving a file in its original format (as required by archivists).
If migration to a more accessible format is required, LOCKSS engineers that
migration, too. Finally, in a step dubbed “dissemination,” the system acts as
a Web proxy, supplying content for every URL at which a particular file originally
was published. In other words, the system establishes the link to original material.
THANKS TO THE CALIFORNIA DIGITAL LIBRARY INITATIVE, anyone in the UC system can submit his copyrighted object and versions for preservation, and they will be retained in perpetuity.
Because the LOCKSS system is based on free, open source software, Reich claims
it is one of the cheapest ways for institutions to tackle the issue of preserving
electronic journals. Aside from standard journal subscription fees, the only
additional requisite costs involved are the money for a computer to run the
software, and the fee to join the LOCKSS Alliance. Currently, membership fees
vary by school size, and range from $1,080 to $10,800 (for the largest institutions).
Reich says that as the program weans itself off grant funding from the National
Science Foundation (
www.nsf.gov)
and the Andrew W. Mellon Foundation (www.mellon.org),
these fees may rise slightly to cover costs of operation. Ultimately, she says,
the fees are secondary— so long as the project has enough money to sustain itself,
Reich says its goal simply is to make preservation cheap and reliable. As it
stands now, schools are putting tens of thousands of dollars toward non-LOCKSS
processes, annually.
“Although libraries are one of our society’s few memory collections, it’s no
secret that we have no money,” she says. “If we are building something worthwhile
that we expect to help [libraries] fulfill their missions over time, it darn
well better be inexpensive.”
Like Netflix, Only Better
While the LOCKSS system catalogs content from electronic journals, a new digital
repository effort at the Harvard Business School focuses on another valuable
medium of information: videos. The Harvard project revolves around a new system
called VideoTools—an elaborate media portal through which the school’s extensive
collection of video assets is automatically coded, managed, shared and published.
The VideoTools system provides video content for singular classroom delivery,
special events, or course-specific compilations. What’s more, according to Larry
Bouthillier, director of Educational Technology and Multimedia Development,
because every video in the database is tied to a distinct URL, faculty members
can easily link to videos in e-mail, lectures, and standard Web pages.
Still, the VideoTools system didn’t happen overnight. Because the business
school teaches almost entirely via case studies, over the years the institution
has amassed quite a library of videos to extend and amplify most lessons. In
1998, Bouthillier oversaw a strategic initiative to digitize more than 1,200
videos; by January 2004, with thousands of new videos waiting for digitization,
the school was in dire need of another plan. With a mix of homegrown J2EE programming
and new technologies, including Helix DNA servers from RealNetworks (www.real.com),
a back-end digital asset management solution from ClearStory Systems (www.clearstorysystems.com),
and encoding automation software from Virage (www.virage.com),
the school set out to perform the Herculean task of rearchitecting its video
library from the ground up.
“There were so many interdependencies that when we set out to upgrade the system,
we realized that we had to rebuild everything all at once,” Bouthillier says,
looking back on the six-figure project. “We knew this would be a big deal, but
I don’t think any of us realized just how much work we’d have to put into making
our video library what we wanted it to be.”
Today, once Harvard Business School technicians create an MPEG digital video
file from a physical video tape, they drop it into a folder, and VideoTools
d'es the rest: everything from populating the database to assigning each video
its own locator URL.
Whenever faculty members request a video, the system enforces
access-control rules to confirm that the user should be allowed to view the
content. Once this access is granted, the system’s Helix DNA servers stream
video into classrooms at 1.2 mbps, a quality that is virtually undistinguishable
from physical DVDs. VideoTools even enables business school faculty and staff
members to create personal collections of video and multimedia content online,
then share that folder with others as part of a mini-portal, or portlet, off
the main VideoTools site.
All of the content in the VideoTools system is searchable by common criteria
such as title, event location, and date. Every video also is cross-referenced
by the faculty members, courses, and cases with which it is associated most
frequently. Videos in the system have searchable transcripts, making the library
fertile ground for advanced searches and what Bouthillier refers to as “serendipitous
discoveries.” Even the text that appears in PowerPoint presentations and other
digital demonstrations becomes searchable in VideoTools. Bouthillier says that
at last check, the system contained more than 7,000 video files overall—literally
terabytes of content in various formats including MPEG-1, MPEG-2, and RealVideo.
Much like the knowledge of the Harvard Business School students, he adds, the
library is designed to grow every day.
“We’ve laid the groundwork for a system that can scale with us over time,”
he says. “When you’re talking about repositories for data of any kind, that’s
the ultimate goal.”
AT HARVARD BUSINESS SCHOOL, the WebTools video portal has remarkable management capability.
Preserving the American West
At some schools, new digital repository efforts are as much about collaboration
as they are about preservation. Such is the case in the University of California
system, where a number of initiatives under the auspices of the California Digital
Library (CDL; www.cdlib.org)
highlight a shared-services approach to preserving digital content of various
kinds. Perhaps the most interesting of these initiatives is an effort dubbed
the American West—a digital repository of physical and electronic information
about the history of the Western US. In 2003, the CDL was awarded a three-year
grant from the William and Flora Hewlett Foundation (www.hewlett.org),
to assemble the collection. The project, which organizes material into 400 topics,
draws upon resources from a variety of major research institutions, including
Indiana University and the University of Washington,
to name a few.
The repository contains information on everything from railroads to the Bonneville
Power Authority and Japanese internment camps. Spearheading the endeavor is
Robin Chandler, the CDL’s director of Built Content.
Chandler says that instead
of lifting actual files from participating institutions, the American West project
harvests metadata (or relevant data about data) from other schools, and then
presents links to the content files on their native servers. Because different
schools create metadata in different formats, the project includes software
designed to normalize these differences and present information on all material
in a uniform format. Down the road, says Chandler, CDL hopes to package this
information for libraries at other institutions to use to launch themed repositories
of their own. “The whole idea is to extend our commitment to shared services
by creating something that works for everyone,” she says.
Another CDL repository, the California Recall Project, aims to record all electronic
content that was generated during California’s historic election to recall former
Gov. Gray Davis in 2003. Working in cooperation with the Stanford University
Computer Science Department and the federal San Diego Supercomputer Center (www.sdsc.edu),
the CDL has crawled and saved thousands of Web sites associated with the election.
According to Patricia Cruse, director of Digital Preservation for the CDL, the
next step for the project will be to explore possibilities for presentation
of these materials, and make them available to any school inside or outside
of California that wants access.
Beyond these collaborative efforts, CDL embarked on another, more systemoriented
repository effort in July: the Digital Preservation Repository (DPR). Based
on a Java-client library and a Web services interface that uses the simple object
access protocol (SOAP), the new system will establish a set of services for
the long-term care of a variety of digital content types, including doctoral
theses, lecture notes, research photographs, and more. Anyone in the UC system
can submit an object for digital preservation, as long as the submitter owns
the copyright to the item. According to Cruse, once the items have been submitted,
CDL staff is responsible for checking submission errors, controlling user access,
and retaining deposited objects and versions in perpetuity.
“Libraries need to be able to collect this information,” says Cruse. “With
[the DPR] and some of our other efforts, we hope to help them do it a little
more easily.”
DSpace / MIT |
The DSpace initiative is one of the largest digital asset management projects in history. It enables academics to place their content in a free and trusted archive, get it indexed, get it on the Web, and make it easy to find associated metadata. It also manages that content over the long term, just as physical libraries would. |
The Mother Ship
Not surprisingly, the biggest innovation in digital repositories today is underway
at MIT. There, the DSpace project (dspace.mit.edu)
is one of the largest digital asset management projects in history; an effort
that enables academics to place their content in a free and trusted archive,
get it indexed, get it on the Web, and easily find associated metadata.
DSpace
manages that content over the long term, just as physical libraries would. The
DSpace system makes two identical copies of all data, catalogues metadata about
the data, and gives each file a unique URL or Web address, much like the VideoTools
device at the Harvard Business School. The address sticks for life, even if
the archivist later wants to migrate a given file into a newer file format.
DSpace began in 2000 as a partnership between MIT and high-tech vendor Hewlett-Packard.
At first, the effort was intended as a “breadth-first” approach to the problem
of digital preservation; aprogram designed to leave no stone unturned. The first
version was an end-to-end, out-of-the-box system containing functionality for
the capture of digital content and associated metadata, storage management,
indexing, and capabilities that enabled end users to search, browse and retrieve.
Gradually, the two entities morphed the system to the open source community
development model, allowing that community to add depth of its own. MacKenzie
Smith, associate director for Technology at MIT Libraries, hails DSpace as an
effort to reverse the trend of researchers losing their hold on physical research
data, a growing problem in government and academia alike.
“Our faculty members are keeping their research under their desks, on lots
of disks, and praying that nothing happens to it,” she says, noting that earlier
this decade, some MIT researchers actually may have misplaced some of their
early studies and communications that led to the creation of the Internet itself.
“We have a long way to go,” she admits.
Today, the key to DSpace is flexibility. Because the software behind the system
is open source, it is available for other institutions to adapt at their own
leisure, and nearly 100 colleges and universities have already done so. Another
benefit to the effort’s open source base is that users constantly are writing
improvements to the code. In the last year, for instance, researchers at MIT
and other schools have crafted new aspects of the software, such as an auditing
feature that can verify whether a file has been corrupted or tampered with,
and a system that checks accuracy when a file is migrated into a new format.
These programmers continually test the DSpace code for weaknesses, making sure
that the system is as “hardened” and secure it possibly can be as it continues
to grow.
Down the line, Smith predicts that the biggest challenge to the DSpace effort
will be a legal one: convincing (and subsequently reminding) faculty to retain
their rights to archive material when the rights are up, so that schools don’t
have to shell out additional money to utilize work that quite rightfully should
be theirs to access free of charge. Understandably, researchers who publish
seek the best deals to publish their work, and these deals frequently require
them to fork over rights to a publisher. But change could be imminent: The National
Institutes of Health (www.nih.gov)
have changed their public access guidelines to include free electronic access
to articles that come out of research funds. Smith says that if researchers
followed this lead and changed the terms of their copyright agreements to allow
for a copy in DSpace, the system could grow exponentially.
“We understand that if we can’t capture content, we won’t have anything to
preserve and we’ll lose the scholarly record,” she says. “The next step for
us is to come up with language [that] researchers can use when they go to publishers
and say, ‘This is what we need to protect our work for the future.’”