Digital Repositories

Worth the Work

Even in the age of Google,digital repositories can addtremendous value to an institution. Yet creating and maintaining these collections is no small task. Three repository directors share their lessons learned and why they believe these vast onlineresources are…

In an era when Google ranks almost daily as the most-visited website in the world, the question has to arise for any higher education institution: Why create digital repositories?

Here’s an exercise to help answer that question: Go to Google and type in “Lewis and Clark.” Results (as of press time) number over 7 million. Add the words “and expedition” to your original search and you get about 1.2 million results. Google’s advanced search options will let you further narrow your results, such as limiting to those posted in English or within the last year. If you add these last two filters, your new Google search results for “Lewis and Clark and expedition” total a mere 653,000.

Now, go to encore.unl.edu and enter “Lewis and Clark” into the search field of Encore, the front end to the University of Nebraska-Lincoln’s digital collections. You’ll get about 2,150 records. But you’ll also get an interface (designed by library technology vendor Innovative Interfaces) that the university employs on top of its repository. You now have the option of using 10 additional (and graphically intuitive) menus to help limit your search, including format (electronic, print, or map), publication date, and a tag cloud with words that have been associated with your search terms. Searching Encore, then, for English-language resources on “Lewis and Clark” and the “Lewis and Clark expedition” tag brings up a manageable 229 results. And if you’re interested only in those published in 2004 (the bicentennial year of the expedition), you have a very targeted 23 items to explore.

But beyond offering a manageable number of results, a UNL Encore search yields items that have been vetted by historians, librarians, or other experts as having academic relevance. On the Google search, a student or faculty researcher would have to wade through genealogy and stock-photo sites alongside Smithsonian resources.

This exercise points out a key advantage of an institutional repository, says Dee Ann Allison, director of computer operations and research services for the UNL libraries. The metadata—those extra details attached to each item in the collection that work as search filters—can consist of whatever criteria the university deems important, making the information far more relevant to academic research than a Google search that is built on a hierarchy of user popularity.4

That metadata has also presented one of the biggest challenges UNL has experienced in developing its repository: data consistency. In addition to standard library titles and web-based materials, the repository features many historically significant items housed by different university departments, as well as works authored by faculty and graduate students. That means that “we have a lot of partners working with us, each doing his own cataloging,” says Allison. As an example, she points to the people working the university’s tractor museum site. “These individuals aren’t necessarily metadata experts—they’re tractor experts. They don’t know metadata. They aren’t always as consistent as a cataloger would be in entering the data. So we have to look at the data and do some standardization of that.”

In addition, Allison admits that “we’re spending a lot of effort, resources, time, and, to some extent, budget creating these [repositories].” Why? “Because we don’t want this university information to be hidden,” she says. An institutional digital repository, she maintains, “is good publicity for what the university is producing. In this time when everyone is trying to get the most for their money, it’s a way of promoting what we’re doing as a university.”

Charles Bailey Jr., publisher of the popular online site Digital Scholarship, agrees. A repository that hosts faculty materials such as published articles, research, and other academic output is particularly important for a university, he says. “It increases the prestige of your institution if it’s seen as being a leader in scholarship. A repository makes it apparent what your institution is accomplishing.”

A similar justification drives the effort behind the New Media video repository at the Universidad Francisco Marroquín in Guatemala City, which hosts the university’s distance learning program and runs a publicly accessible video podcast site. “We support our professors, researchers, and students in the use, creation, and management of digital resources that complement their academic work,” the New Media website declares. But, adds Rebeca Zuniga, director of New Media at UFM, the value goes beyond prestige for this major university in a country that has seen its share of coups and civil war and doesn’t have many publicly available library collections. “The mission of this repository is to disseminate the idea of freedom for the university.”

Digital repository advocates will concede, however, that the challenges in building and maintaining these collections can daunt even the most intrepid supporters. CT talked to the directors of three different types of collections—cross-institutional, stand-alone university, and departmental—to learn how they manage the biggest challenges posed by the digital repositories that they believe add continuing value to their institutions.

Sorting out Storage

Repositories tend to be storage hogs. But storage systems that in the past might have been cobbled together with scattered hardware have given way to more sophisticated schemes with built-in tools to scale capacity and ensure redundancy.

For example, the 200 terabytes of data that make up the HathiTrust project (a repository of digital materials from a number of large research universities) are stored in their entirety in two separate locations, the main site at the University of Michigan and a mirrored site at the Indianapolis campus of Indiana University. Users don’t know which site they’re actually hitting when they do a search. If a problem surfaces at one site, the system will automatically fail over to the other site, so there’s no downtime.

The Repositories Profiled

HathiTrust began in 2007 when the Committee on Institutional Cooperation (a consortium of the Big Ten research universities along with the University of Chicago)—and later the University of California and the University of Virginia—decided to create a shared repository consisting of digital content the universities were receiving through an agreement with Google and its digital book project, along with their own locally digitized collections. hathitrust.org

The Universidad Francisco Marroquín in Guatemala hosts a media project that preserves videos of lectures, forums, discussions, and interviews, many of them featuring world leaders who come to the university to speak. Currently, the New Media repository has 1,259 videos covering economics, medicine, dentistry, ecology, business administration, engineering, and psychology. newmedia.ufm.edu

The University of Nebraska-Lincoln has developed an institutional repository that includes 200,000 records from 79 diverse collections: among them, source materials from famous alumna Willa Cather, as well as the Lewis and Clark journals and a Walt Whitman collection; slides of art and architecture from those departments on campus; musical performance recordings from the school of music; and PDF files of articles published by faculty and dissertations by students. encore.unl.edu

HathiTrust uses Isilon Systems’ clustered storage solution which, according to Jeremy York, HathiTrust project librarian, has integrity checking, good management tools, extraordinarily good read/write performance, powerful tools for syncing storage across distances, and excellent expansion capabilities. “It’s easy to tack on another node and you have many more terabytes of data,” York explains.

To put up the mirrored site at IU, HathiTrust chose to physically truck storage servers down to Indianapolis rather than tackle that replication process online. After all, York says, the latter approach was estimated to take between two and three weeks. The drive was faster.

At UNL, two parts of the school’s digital repository are stored locally—the harvested collections and a collection powered by CONTENTdm—and administrators do a daily backup. Copies of the backups are sent off site for disaster recovery. If a server goes down, Allison says, the university has procedures to bring it up, typically within 48 hours.

However, in-house hosting isn’t the only form of storage used by the university. Its DigitalCommons repository—which includes 11,100 PhD dissertations going back to 1902, along with 28,000 faculty- and student-generated documents—is stored in the cloud and managed by Berkeley Electronic Press (BePress). “We have never had any downtime with them, so if they had a problem, we’ve never noticed any disruption in service,” Allison says.

The cloud is tempting the New Media department at UFM in Guatemala. Officials there are pondering moving the department’s 5 terabyte video repository and streaming operations to Wowza Media Systems, hosted on the Amazon Elastic Compute Cloud. Offsite hosting would offer several advantages to New Media: UFM technicians wouldn’t have to fuss over visitor bandwidth restrictions since Wowza could sort out the streaming aspects, and the department would be able to avoid adding servers as the repository grows. Furthermore, Zuniga adds, the hosted solution would be “faster and more reliable.” The delay in making the move at this point is economic. “It’ll be a little bit expensive, so we’re waiting,” she says.

Grappling With Copyright

Digital repositories face a complex assortment of challenges related to licensing and rights management. For example, the HathiTrust project has integrated copyright analysis through scripts and routines that automatically determine whether a given item should be opened for a specific user. “It’s very conservative,” claims York. The system uses filters such as publication date, publisher, and publisher location to answer the basic question: “Is this in the public domain or not?” He explains, “If it’s published in the United States before 1923, it’s in the public domain. No problem. If it’s published outside the US before 1870, that’s no problem. But there’s this gray area for people accessing works published outside the US where they’re under different copyright laws in different countries. We have a status called PDUS—public domain in the US. Those volumes for somebody accessing [a work] from France or Morocco would be a limited view because we don’t have the resources to determine if it’s in copyright for that country.”

Keeping Up With Video Formats

Since the NEW MEDIA center at the Universidad Francisco Marroquín (Guatemala) began in 2001, formats for streaming its lecture videos online have changed. What the university began with—Windows Media Video or WMV—evolved into RealNetworks’ RealVideo, then to Adobe Flash Video (FLV), and most recently to the ISO-standard MP4. Each time, the migration was motivated by a desire to expand accessibility for New Media’s worldwide base of visitors, many of whom lack fast internet connectivity.

But the center doesn’t just offer the most recent recordings in the new format; it also has worked to convert its 5 terabytes worth of existing recordings. According to Rebeca Zuniga, director of New Media at UFM, the team is currently experimenting with MediaCoder, a free open source media transcoder. But, she says, “It’s very boring work. I hope we can find something better to do it with, but so far we have to do it step by step.”

New Media has learned that keeping a video clip’s metadata separate from the video itself allows for easier migration of the content. That metadata is described in XML, which means in the future it can be migrated to any new video players, browsers, or platforms. This protects the university’s intellectual investment in compiling those materials while still enabling it to move its video recordings onto newer formats.

At U Nebraska-Lincoln, the concerns are more focused on whether somebody is resident to the institution or a public visitor. Some collections are able to be viewed in digital form only by somebody who can be authenticated through a university proxy server when he or she tries to access the resource. A public visitor would get just a limited view. The rules, according to Allison, are set up on a collection-by-collection basis.

UFM’s New Media department has each professor or speaker sign a release form before a taping begins, establishing whether the speaker has rights to images used in lectures. Even so, the content analyst in charge of preparing a lecture for inclusion in the digital repository often will have to determine if the lecturer has rights to reproduce particular images. Depending on the situation, the file may end up including the audio with the video, but not the slides, or the presentation will have specific slides removed.

Repositories that duplicate materials published by faculty in journals have another copyright challenge: making sure they have the right to republish the content. “The problem is that each of these publishers has its own policy that governs how the faculty member can do this,” explains Digital Scholarship’s Bailey. “Some say you can’t do this at all. Others say you can archive what’s called the pre-print—before it’s refereed. In the next level you can self-archive the article in the version that’s the final draft.”

So how does a faculty member with many years of publishing sort out the various stipulations for his or her content? There are a couple of solutions recommended by Bailey. Sherpa, a UK-based organization, offers RoMEO, a free service that analyzes publisher policies and provides a color-coding system for each type of permission, to speed up the work of determining republishing rights. Second, he notes that libraries have begun adding positions on their staffs dedicated to helping faculty sort out their archiving needs. That’s the case at UNL, where one person with a part-time helper handles the rights job, according to Allison.

People Challenges

If the point of a digital repository is to compile the articles and research published by faculty members, a common challenge is figuring out how to motivate those individuals to get their materials entered into the repository. After all, says Bailey, “Faculty are busy people with lots of competing demands on their time—research, teaching, publications, public service, committee work.” Short of having the faculty senate mandate participation, are there softer approaches to encouraging participation?

Duplicating Efforts

As a cross-institutional repository of content from research libraries across the US, HathiTrust wrestles with unique duplication problems, says project librarian Jeremy York. First and foremost: Google, which as part of an agreement feeds most of the repository content through its digital book project, is particular about what is being duplicated. “There are institutions sending volumes to Google, and [Google] will reject them if it has already scanned the material from another institution,” York explains. Then there’s the problem of ingesting different versions of the same volume. “Maybe there’s a locally scanned version that wasn’t done as part of a mass digitization project, that’s of extremely high resolution and very good quality. Which one do we keep, or do we keep them all?” ponders York. “How do we highlight them for users, if there’s one with beautiful color pictures and one without?”

Addressing this challenge is where a strong governance model comes into play. HathiTrust has an executive board made up of CIOs from its core institutions. This board manages the budget and finances and makes all major decisions. A strategic advisory board makes policy recommendations to that executive board, assembling working groups to tackle specific projects such as questions of content duplication.

“At times there are disagreements, but it always comes back to: What is our core mission and what are we here for?” York explains. He says that the advisory board measures the working groups’ recommendations against HathiTrust’s greater goals. In the case of deciding how to handle variations in quality, for example, York points out that “the core mission is to preserve intellectual content. This is not special collections scanning. It’s not important to maintain the ‘artifactual’ value of books in a lot of cases—the tint of the page or something like that. So, if we can derive the formats later, we’ll take the smaller package.”

One persuasive argument is simply that the repository increases the visibility of faculty articles. “Tenure decisions are driven by measures of citation usage,” Bailey notes. “So you are a faculty member, and you publish in [an] incredibly obscure journal that’s very expensive and increasingly unaffordable to many institutional libraries. How many people are going to read your article?” When that article is placed in the repository, it will not only be available within the institution, but it could also show up in public search engines such as Google and Bing, as well as specialized search engines such as OAIster, a database of 23 million digital repository records. “Your research becomes much more visible to the world and to specialists throughout the world,” Bailey points out. “That greatly increases the likelihood that you’re going to be cited.”

Another challenge of working with people to build digital repositories is simple communication. For instance, institutions involved in HathiTrust are scattered across the country. Most discussions take place by conference call. Early in the planning process, participants gathered by videoconference, says York, “but there were so many people in the conference rooms, everybody was very small [on screen] and only a few people ended up speaking out.” The problem with that, he explains, is that it’s hard to get a sense of people and build the relationships that are necessary to get things going in a repository project.

So HathiTrust held a two-day gathering at the University of Michigan during the summer of 2009. As a result, “We went leagues forward,” York reports, noting without any irony that “it made such a difference to have the people there and to be able to talk out the issues without the intervening technology.”

Resources

Amazon Elastic Compute Cloud:aws.amazon.com/ec2

The Berkeley Electronic Press: bepress.com/ir

CONTENTdm: contentdm.org

Digital Scholarship: digital-scholarship.org

Innovative Interfaces: iii.com

Isilon Systems: isilon.com

MediaCoder: mediacoderhq.com

OAIster: oaister.worldcat.org/advancedsearch

Sherpa RoMEO: www.sherpa.ac.uk/romeo

Wowza Media Systems: wowzamedia.com

comments powered by Disqus