Digital Libraries >> 'Opening' A Digital Library
        
        
        
        Digital libraries are not new, but open source, video, and collaborative 
  digital repositories are changing the face of library science.
If only Melvil Dui (née Melville Dewey) could see what’s become 
  of library sciences today. How would the father of the Dewey Decimal system 
  categorize Web pages, which grow at an alarming rate of nearly seven million 
  per day? How would he organize the hundreds of thousands of video files, and 
  hundreds of thousands of audio files or podcasts that now supplement written 
  words as content? Most importantly, how on Earth would Dewey—perhaps the most 
  famous librarian of all time—manage to represent the body of any one university’s 
  research notes, white papers, dissertations, and other assorted scholarly content 
  in one card catalog? 
Methinks the poor guy probably would go insane. 
For modern-day librarians whose mission is to build collections and transmit 
  today’s intellectual, cultural, and historical output to the future, the challenge 
  is equally daunting. While many schools have responded with efforts to digitally 
  scan their physical collections, a new wave of digital repositories designed 
  to save only certain types of content are changing the face of library science 
  everywhere. In particular, efforts at Stanford University (CA), 
  Harvard Business School (MA), the University of California 
  system, and the Massachusetts Institute of Technology stand 
  out as innovations that could forever revolutionize the way we think about storing 
  content. Good ol’ Dewey— spelling-reform eccentricities and all— would be proud. 
  
Establishing “Cache” Flow
 
Unless you’ve been living in a cave, you no doubt know that librarians and 
  technologists at Stanford made headlines in December 2004 when they unveiled 
  a controversial plan to work with Internet search engine Google to digitize 
  an entire collection of more than eight million volumes (the plan has since 
  been scaled back). Yet, while the news media focused on the Stanford librarians 
  involved in the deal, the university’s Victoria Reich was hard at work on a 
  digital library project of her own, a revolutionary repository known as LOCKSS. 
  The project, an acronym for “Lots of Copies Keep Stuff Safe,” revolves around 
  open source software that provides institutions with an easy and inexpensive 
  way to collect, store, preserve, and provide access to local copies of authorized 
  content they have purchased from publishers. According to Reich, the project 
  director, it just might revolutionize the way schools store electronic journals 
  for generations. 
   | LOCKSS/Stanford University |  
  | Because the LOCKSS repository system is based on free, open source software, it’s an inexpensive way for institutions to preserve electronic journals. Aside from the standard journal subscription fees, the only additional requisite costs are the money for a computer to run the software, and the relatively modest fee to join the LOCKSS Alliance. |  
 
In fact, the LOCKSS project began in 1999, when it became clear to Reich and 
  others that scholarly journals were moving from paper to the ’Net, and libraries 
  were quickly finding that they were not prepared to grow online content collections 
  accordingly. The result was a tool that would allow them to build electronic 
  journal collections easily and affordably. Today, instead of focusing on the 
  construction of a single centralized repository as in other digital library 
  initiatives, the LOCKSS project takes a more decentralized approach. Member 
  schools can download the open source software and get free upgrades by following 
  links on lockss.stanford.edu. 
  Currently, more than 90 colleges and universities are running “LOCKSS boxes”—computers 
  with the open source software running live. On each campus, the computers can 
  store 3,000 years of journal articles apiece. 
“In the physical world, libraries have succeeded until now because they are 
  so loosely coupled,” says Reich, pointing to the advantages of decentralization 
  and autonomy. “We looked at LOCKSS and realized that it made perfect sense to 
  embrace the same model that’s worked wonderfully for hundreds of years.” 
How d'es it work? Essentially, the LOCKSS approach facilitates a cache of data 
  that simply never gets flushed. The technology employs a three-step process 
  for preserving data that requires minimal human intervention beyond the initial 
  setup. Via the first step—which Reich calls “ingest”—the system crawls publisher 
  Web sites to collect new content as it appears, and performs audit and quality 
  control on what it finds. In the second step, known as “preservation,” the system 
  preserves content by saving a file in its original format (as required by archivists). 
  If migration to a more accessible format is required, LOCKSS engineers that 
  migration, too. Finally, in a step dubbed “dissemination,” the system acts as 
  a Web proxy, supplying content for every URL at which a particular file originally 
  was published. In other words, the system establishes the link to original material. 

THANKS TO THE CALIFORNIA DIGITAL LIBRARY INITATIVE, anyone in the UC system can submit his copyrighted object and versions for preservation, and they will be retained in perpetuity.
 
Because the LOCKSS system is based on free, open source software, Reich claims 
  it is one of the cheapest ways for institutions to tackle the issue of preserving 
  electronic journals. Aside from standard journal subscription fees, the only 
  additional requisite costs involved are the money for a computer to run the 
  software, and the fee to join the LOCKSS Alliance. Currently, membership fees 
  vary by school size, and range from $1,080 to $10,800 (for the largest institutions). 
  Reich says that as the program weans itself off grant funding from the National 
  Science Foundation (
www.nsf.gov) 
  and the Andrew W. Mellon Foundation (www.mellon.org), 
  these fees may rise slightly to cover costs of operation. Ultimately, she says, 
  the fees are secondary— so long as the project has enough money to sustain itself, 
  Reich says its goal simply is to make preservation cheap and reliable. As it 
  stands now, schools are putting tens of thousands of dollars toward non-LOCKSS 
  processes, annually. 
“Although libraries are one of our society’s few memory collections, it’s no 
  secret that we have no money,” she says. “If we are building something worthwhile 
  that we expect to help [libraries] fulfill their missions over time, it darn 
  well better be inexpensive.” 
Like Netflix, Only Better
While the LOCKSS system catalogs content from electronic journals, a new digital 
  repository effort at the Harvard Business School focuses on another valuable 
  medium of information: videos. The Harvard project revolves around a new system 
  called VideoTools—an elaborate media portal through which the school’s extensive 
  collection of video assets is automatically coded, managed, shared and published. 
  The VideoTools system provides video content for singular classroom delivery, 
  special events, or course-specific compilations. What’s more, according to Larry 
  Bouthillier, director of Educational Technology and Multimedia Development, 
  because every video in the database is tied to a distinct URL, faculty members 
  can easily link to videos in e-mail, lectures, and standard Web pages. 
Still, the VideoTools system didn’t happen overnight. Because the business 
  school teaches almost entirely via case studies, over the years the institution 
  has amassed quite a library of videos to extend and amplify most lessons. In 
  1998, Bouthillier oversaw a strategic initiative to digitize more than 1,200 
  videos; by January 2004, with thousands of new videos waiting for digitization, 
  the school was in dire need of another plan. With a mix of homegrown J2EE programming 
  and new technologies, including Helix DNA servers from RealNetworks (www.real.com), 
  a back-end digital asset management solution from ClearStory Systems (www.clearstorysystems.com), 
  and encoding automation software from Virage (www.virage.com), 
  the school set out to perform the Herculean task of rearchitecting its video 
  library from the ground up. 
“There were so many interdependencies that when we set out to upgrade the system, 
  we realized that we had to rebuild everything all at once,” Bouthillier says, 
  looking back on the six-figure project. “We knew this would be a big deal, but 
  I don’t think any of us realized just how much work we’d have to put into making 
  our video library what we wanted it to be.” 
Today, once Harvard Business School technicians create an MPEG digital video 
  file from a physical video tape, they drop it into a folder, and VideoTools 
  d'es the rest: everything from populating the database to assigning each video 
  its own locator URL.
Whenever faculty members request a video, the system enforces 
  access-control rules to confirm that the user should be allowed to view the 
  content. Once this access is granted, the system’s Helix DNA servers stream 
  video into classrooms at 1.2 mbps, a quality that is virtually undistinguishable 
  from physical DVDs. VideoTools even enables business school faculty and staff 
  members to create personal collections of video and multimedia content online, 
  then share that folder with others as part of a mini-portal, or portlet, off 
  the main VideoTools site. 
All of the content in the VideoTools system is searchable by common criteria 
  such as title, event location, and date. Every video also is cross-referenced 
  by the faculty members, courses, and cases with which it is associated most 
  frequently. Videos in the system have searchable transcripts, making the library 
  fertile ground for advanced searches and what Bouthillier refers to as “serendipitous 
  discoveries.” Even the text that appears in PowerPoint presentations and other 
  digital demonstrations becomes searchable in VideoTools. Bouthillier says that 
  at last check, the system contained more than 7,000 video files overall—literally 
  terabytes of content in various formats including MPEG-1, MPEG-2, and RealVideo. 
  Much like the knowledge of the Harvard Business School students, he adds, the 
  library is designed to grow every day. 
“We’ve laid the groundwork for a system that can scale with us over time,” 
  he says. “When you’re talking about repositories for data of any kind, that’s 
  the ultimate goal.”
AT HARVARD BUSINESS SCHOOL, the WebTools video portal has remarkable management capability.
 
Preserving the American West
 
At some schools, new digital repository efforts are as much about collaboration 
  as they are about preservation. Such is the case in the University of California 
  system, where a number of initiatives under the auspices of the California Digital 
  Library (CDL; www.cdlib.org) 
  highlight a shared-services approach to preserving digital content of various 
  kinds. Perhaps the most interesting of these initiatives is an effort dubbed 
  the American West—a digital repository of physical and electronic information 
  about the history of the Western US. In 2003, the CDL was awarded a three-year 
  grant from the William and Flora Hewlett Foundation (www.hewlett.org), 
  to assemble the collection. The project, which organizes material into 400 topics, 
  draws upon resources from a variety of major research institutions, including 
  Indiana University and the University of Washington, 
  to name a few. 
The repository contains information on everything from railroads to the Bonneville 
  Power Authority and Japanese internment camps. Spearheading the endeavor is 
  Robin Chandler, the CDL’s director of Built Content.
Chandler says that instead 
  of lifting actual files from participating institutions, the American West project 
  harvests metadata (or relevant data about data) from other schools, and then 
  presents links to the content files on their native servers. Because different 
  schools create metadata in different formats, the project includes software 
  designed to normalize these differences and present information on all material 
  in a uniform format. Down the road, says Chandler, CDL hopes to package this 
  information for libraries at other institutions to use to launch themed repositories 
  of their own. “The whole idea is to extend our commitment to shared services 
  by creating something that works for everyone,” she says. 
Another CDL repository, the California Recall Project, aims to record all electronic 
  content that was generated during California’s historic election to recall former 
  Gov. Gray Davis in 2003. Working in cooperation with the Stanford University 
  Computer Science Department and the federal San Diego Supercomputer Center (www.sdsc.edu), 
  the CDL has crawled and saved thousands of Web sites associated with the election. 
  According to Patricia Cruse, director of Digital Preservation for the CDL, the 
  next step for the project will be to explore possibilities for presentation 
  of these materials, and make them available to any school inside or outside 
  of California that wants access. 
Beyond these collaborative efforts, CDL embarked on another, more systemoriented 
  repository effort in July: the Digital Preservation Repository (DPR). Based 
  on a Java-client library and a Web services interface that uses the simple object 
  access protocol (SOAP), the new system will establish a set of services for 
  the long-term care of a variety of digital content types, including doctoral 
  theses, lecture notes, research photographs, and more. Anyone in the UC system 
  can submit an object for digital preservation, as long as the submitter owns 
  the copyright to the item. According to Cruse, once the items have been submitted, 
  CDL staff is responsible for checking submission errors, controlling user access, 
  and retaining deposited objects and versions in perpetuity. 
“Libraries need to be able to collect this information,” says Cruse. “With 
  [the DPR] and some of our other efforts, we hope to help them do it a little 
  more easily.” 
 
  | DSpace / MIT |  
  | The DSpace initiative is one of the largest digital asset management projects in history. It enables academics to place their content in a free and trusted archive, get it indexed, get it on the Web, and make it easy to find associated metadata. It also manages that content over the long term, just as physical libraries would. |  
 
 
The Mother Ship
 
Not surprisingly, the biggest innovation in digital repositories today is underway 
  at MIT. There, the DSpace project (dspace.mit.edu) 
  is one of the largest digital asset management projects in history; an effort 
  that enables academics to place their content in a free and trusted archive, 
  get it indexed, get it on the Web, and easily find associated metadata.
DSpace 
  manages that content over the long term, just as physical libraries would. The 
  DSpace system makes two identical copies of all data, catalogues metadata about 
  the data, and gives each file a unique URL or Web address, much like the VideoTools 
  device at the Harvard Business School. The address sticks for life, even if 
  the archivist later wants to migrate a given file into a newer file format. 
DSpace began in 2000 as a partnership between MIT and high-tech vendor Hewlett-Packard. 
  At first, the effort was intended as a “breadth-first” approach to the problem 
  of digital preservation; aprogram designed to leave no stone unturned. The first 
  version was an end-to-end, out-of-the-box system containing functionality for 
  the capture of digital content and associated metadata, storage management, 
  indexing, and capabilities that enabled end users to search, browse and retrieve. 
  Gradually, the two entities morphed the system to the open source community 
  development model, allowing that community to add depth of its own. MacKenzie 
  Smith, associate director for Technology at MIT Libraries, hails DSpace as an 
  effort to reverse the trend of researchers losing their hold on physical research 
  data, a growing problem in government and academia alike. 
“Our faculty members are keeping their research under their desks, on lots 
  of disks, and praying that nothing happens to it,” she says, noting that earlier 
  this decade, some MIT researchers actually may have misplaced some of their 
  early studies and communications that led to the creation of the Internet itself. 
  “We have a long way to go,” she admits. 
Today, the key to DSpace is flexibility. Because the software behind the system 
  is open source, it is available for other institutions to adapt at their own 
  leisure, and nearly 100 colleges and universities have already done so. Another 
  benefit to the effort’s open source base is that users constantly are writing 
  improvements to the code. In the last year, for instance, researchers at MIT 
  and other schools have crafted new aspects of the software, such as an auditing 
  feature that can verify whether a file has been corrupted or tampered with, 
  and a system that checks accuracy when a file is migrated into a new format. 
  These programmers continually test the DSpace code for weaknesses, making sure 
  that the system is as “hardened” and secure it possibly can be as it continues 
  to grow. 
Down the line, Smith predicts that the biggest challenge to the DSpace effort 
  will be a legal one: convincing (and subsequently reminding) faculty to retain 
  their rights to archive material when the rights are up, so that schools don’t 
  have to shell out additional money to utilize work that quite rightfully should 
  be theirs to access free of charge. Understandably, researchers who publish 
  seek the best deals to publish their work, and these deals frequently require 
  them to fork over rights to a publisher. But change could be imminent: The National 
  Institutes of Health (www.nih.gov) 
  have changed their public access guidelines to include free electronic access 
  to articles that come out of research funds. Smith says that if researchers 
  followed this lead and changed the terms of their copyright agreements to allow 
  for a copy in DSpace, the system could grow exponentially. 
“We understand that if we can’t capture content, we won’t have anything to 
  preserve and we’ll lose the scholarly record,” she says. “The next step for 
  us is to come up with language [that] researchers can use when they go to publishers 
  and say, ‘This is what we need to protect our work for the future.’” 
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    Matt Villano is senior contributing editor of this publication.