Taming the Digital Beast

Is your digital institutional repository out of control? It’s time to step back and look at contribution, access, rights, storage, and functionality—issues you don’t want to monkey with.

DigitalNo one will dispute that academic institutions excel at generating and collecting knowledge and information, but when it comes to incorporating modern technologies, students have been farther ahead of the curve than their institutions. Too many schools are still mired in paper admissions processes, for instance, while their students are actively trading MP3 files across the school’s Internet connection.

Though more gradually than their charges, schools are moving to modern digital media as a means of archiving and accessing their vast stores of knowledge. And campus library sciences professionals are partnering with IT to lead the way as the data and information explosion propels the cause forward. Sharing content has been a leading driver of the digital repository initiative, because, simply put, unshared knowledge isn’t knowledge—it’s a secret.

“We’ve always been good at finding, selecting, acquiring, storing, and distributing content in a variety of non-print media formats,” says Peter Deekle, dean of University Library Services at Roger Williams University (RI). “But we realized that much of the content we were generating was unique and hadn’t been published in non-print formats. We thought: Shouldn’t we have a hand in publishing this content?” In the past two to three years, the same notion has occurred to other educators, as well.

In fact, an early adopter of digital repositories was Denison University (OH), which set up an institutional repository using CONTENTdm from DiMeMa, software first developed at the University of Washington. The Denison project was primarily used by the arts school to scan its large library of images, and served as a learning experience for future projects, says Scott Siddall, assistant provost and director of Instructional Technology at the school. Still, the ramp-up has been slow at Denison and elsewhere because professors haven’t been sold on the idea, he says. “It’s still a real push to get people to contribute material to a repository, based on the sense that they will be sharing it,” he explains. “They’ll say, ‘I have a unique collection of such-and-such. Why do I care if my colleague at XYZ University wants access? Why should I take three months to make it available?’”

The solution, he maintains, is a different kind of carrot. “Institutions have to say, ‘If you create a unique collection, digitize it, put it up online, and let people access it, that is scholarship; that is valued, and we’re going to count it in promotion and tenure.’ Then people will put it on their radar screen,” says Siddall.

Geneva Henry, Digital Library Federation distinguished fellow and executive director of the Digital Library Initiative at Rice University (TX), concurs. “You’ve got to have faculty buy into it. And getting faculty to agree to allow their publications to reside in an institutional repository is not easy, because publishers have them convinced not to do it. So there’s a lot of trust building that needs to go on with the faculty,” she says.

Public Domain, to Start

Just what is going into the repository is another matter. The potentially thorniest issues—those of copyright and surrounding concerns (who can access what; when rights or access should expire or content should be replaced/ updated)—has been ducked somewhat, at least for the start, by putting only public domain materials online.

Denison, for one, has used only public domain materials, or materials created by Denison faculty who have given permission or have released the works under a Creative Commons license. Creative Commons is a nonprofit organization founded in 2001 by Stanford University (CA) law professor and copyright scholar Lawrence Lessig and several colleagues. The CC license has a number of variations, and the goal is to provide flexible copyright options so that creators can specify conditions under which they will share their rights publicly—in essence stating, “some rights reserved.”

Paul Fisher

"If you’re not giving the repository the
organization and metadata capability
Library Sciences administrators can
give it, it’s just a pile of junk."
— Paul Fisher, Seton Hall

In Rice University’s Connexions repository, intellectual material is freely distributed, and can be used by any academic in his or her courseware. Currently, the repository contains more than 3,000 modules covering 143 different courses. A professor could quite literally assemble a courseware “book” on anything from the British Parliamentary system to digital signal processing, simply by mining the contributions in the Connexions database. On the flip side, contributors can write and submit a whole book or just a chapter on a given topic.

This means that a single book assembled from the Connexions database can comprise a dozen authors. If Rice had to deal with copyright issues for each author, the project would grow unwieldy and eventually would become unusable. “If you start getting restrictive, it gets viral, and more and more material is locked down,” says Henry. “It totally ruins reuse when you start [using copyrighted materials].”

Yet, there is quite a faculty education process that needs to be undertaken, because authoring material designed to be given away online is not what academics are used to doing, Henry points out. “The initial reaction is always negative, especially with the humanities,” she adds. Those in engineering and the sciences, where information changes so rapidly, “get it” much more quickly, she explains, and that is because those ever-changing fields have a harder time keeping their textbooks up-to-date, and are in general more willing to embrace a communally accessible concept like Connexions.

The aim of such repositories is to share knowledge others may not be aware of, and most are sticking with public domain/ non-copyrighted information. According to Deekle at Roger Williams, that university, for example, is sharing its many unique resources about its namesake, the founding father of the state of Rhode Island. Other Rhode Island institutions will also be sharing their unique information: Brown University’s repository, for instance, will share its resources on public policy, and the University of Rhode Island will share its extensive collections in the biological sciences, a field in which it excels. Institutional repository projects can be broadly focused, or they can be highly specialized, like DialogPlus, a collaborative project of Pennsylvania State University, the University of Leeds (UK), UCSanta Barbara, and the University of Southampton (UK). This particular project, launched in February 2003, was designed solely to share geological data.

The Project Team: Design, Functionality, Support

Most librarians and technologists agree that a digital repository must be a campuswide effort that involves administrators, campus technologists (particularly developers), Library Sciences administrators, and the heads of every department.

Of course, Library Sciences must be involved because these individuals are the experts in cataloging information, says Paul Fisher, director of the Teaching, Learning and Technology Center at Seton Hall University (NJ). “Ask yourself this question: Why are we putting these items in a digital repository? Answer: We want to give access to people— and those people are probably other academics doing research, either professors or students. But those who best know how to conduct research are librarians; that’s library science. So, not having librarians help you design the database and point out what data you need to collect would be a major flaw in any project,” he says. Clearly, without Library Sciences on the team, proper planning cannot take place. Planning at the start of a digital repository project will make the difference between a repository that grows and remains highly usable, and one that becomes an unwieldy monster.

Says Siddall at Denison: “I see a lot of subject-matter experts diving into a project of this sort, without involving the right people immediately. They start to catalog things into a metadata schema that’s incomplete, and end up having to go back and redo a lot of it later.” Proper planning and defining of metadata will help the repository remain easily searchable as it grows and more content is added.

Fisher at Seton Hall agrees. “If you’re not giving it that organization and metadata capability, it’s just a pile of junk.” Library sciences personnel are not the experts on the best ways to deliver the data, however; that’s where the technologists come in. They are the ones who’ll advise that the dream repository can or can’t be executed as envisioned; they’re also the arbiters of cost.

“You have to carry out planning and design with a database administrator,” says Fisher. “Forgetting that is a big mistake people make. Librarians are information experts, absolutely, but they’re not database administrators. Having the technology and information experts at the same table is critical.” If a database is well-designed from the start, there should be little maintenance needed unless something g'es wrong, he adds. “A repository should have the capability to grow constantly, with only one maintenance concern: ‘We’re running out of storage.’”

But Henry at Rice believes there should be an individual dedicated to the task of maintaining the programming and databases. She has a full-time programmer dedicated to the university’s repositories. “I would strongly recommend someone dedicated to programming, because you will always run into new features you want to add,” she points out.

Institutional repository projects can be broadly focused, or they can be highly specialized, like DialogPlus — a collaborative project of Penn State, the University of Leeds, UC-Santa Barbara, and the University of Southampton.

As to responsibility for content management and the determination of access parameters, those things should be left up to the departments, all those interviewed here agree.

“We [Library Sciences administrators] don’t want to be the exclusive gatekeepers with an absolute final say,” says Deekle. “That’s why the faculty have the responsibility to say, ‘This is really important and must be there,’ or, ‘You’ve made this expansively accessible and we don’t want everybody to access it.’”

Open Source vs. Packaged Software

Another concern when building a repository is the choice of software: packaged or open source? Siddall says the digital asset management software market is a $3 billion industry with almost 600 vendors and more than 1,000 products, but as consumers of these products, higher education is just a “little blip” compared to government contractors like the Department of Defense. Not surprisingly, cost issues are always a concern for any campus technology effort, and when it comes to keeping costs down, the advent of free, open source software has been a blessing for many schools. Because so many open source projects have their roots in academia, it’s also not surprising that there are some significant open source digital repository efforts—top among them, DSpace, developed by MIT and Hewlett-Packard. DSpace is in use at 138 universities and institutions worldwide, including at Rice.

“We looked at some commercial software as well,” says Henry, “but we’re very much committed to open source at Rice, and DSpace is becoming a more and more mature platform.” She also likes the way DSpace is designed with digital media for an academic environment, and she appreciates the fact that the management tools are structured for a university system. “You can control access in a number of ways and delegate authorization to submit materials at a number of levels,” she explains. “We don’t have a huge staff to support these projects, so I needed a system where I could push those privileges down as deep as possible into the organization.” Thus, DSpace allows the Chemistry department, for instance, to define what content it will accept, in what format, and who can access it.

Still, Roger Williams decided against DSpace for the same reason so many other institutions and corporations have shunned or minimized their use of open source: The software may be free up front, but you can get bitten on the back end, they complain. Says Deekle, “Open source is great from the acquisition standpoint, but the back-end support required to maintain some of these solutions— the custom programming needing programmers of different languages —was prohibitive. We don’t have the kinds of resources available to handle apps like that.”

Ultimately, Roger Williams went with a commercial package from ProQuest Company, which specializes in document management software for campuses. The university liked the software’s management functionality. “ProQuest will let us make access as restricted or permitted as the submitter requests. It handles public rights and access very well, and allows us to let the different departments define who can access what,” says Deekle.

However, the choices of software aren’t as important as the format used to store the data. Making sure to utilize open, widely used data formats (and lots of metadata) is what really matters, says Siddall. “What’s nice about standards is that there are so many of them,” he says, reciting the old joke. “As long as you use standards that are international in scope, you’ll be fine.” That means, he says, using the Dublin Core metadata standard, JPEG 2000 for images, and the Adobe Portable Document Format (PDF), among others.

You’re Set Up. Now, What Belongs in the Repository?

Deciding what to put in your repository can be a bigger task than some may realize. The initial temptation is to throw everything into it, but that impulse has to be tempered by the reality of bogging down your database with lengthy searches—not to mention scanning all of the data and attaching metadata information, the latter of which is prohibitively time-consuming, offers Siddall.

The reality is, “You have to fan the embers to get people to contribute because creation of metadata takes so much work,” he says. “Performance is an issue we can get around. The signal-tonoise ratio [i.e., hits vs. misses], where people find exact matches to what they are seeking, is critical. That’s a qualityof- metadata issue, and that’s why upfront planning is required.” Yet, even with optimal design, terabytes of storage, and fast computer systems, there is a functional consideration: Do you take the “library” approach, where everything is gathered in a single, large, central repository, or do you break up the information by schools or departments?

Geneva Henry

"With DSpace, you can control access.
We don’t have a huge staff to support
these projects, so I needed a system
where I could push those privileges
down deep into the organization."
— Geneva Henry, Rice

Seton Hall has taken the monolithic route, at least for now, says Fisher. “I might be making a big mistake, but it seems to me that part of the power is to search all Seton Hall publications for a keyword, or one publication for a keyword, and I don’t know how I would do that full-swoop search if they were separated,” he says. “If we had them separated, people wouldn’t be able to search them in a single search.” It all depends on your back end, he adds, pointing out that Seton Hall is using IBM blade technology and an Oracle database with a campus license. “I guess if I were building this on an Access database I’d be really worried, but I’m not working with itty bitty tools,” he jokes.

Management of the data is an important issue, however, because if academics scan in information, they want it —all of it—readily available. Lesserused data can’t be relegated to slower servers or to magnetic tape; it must be as readily accessible as the most popular searches, says J'e Pangborn, CIO at Roger Williams.

“In an ideal world, data would be placed based on access statistics,” he says. Some institutions practice lifecycle management: Rarely used content is archived, while frequently used material remains on the fastest servers for quick access. “But the culture here d'esn’t permit that,” says Pangborn. “Our faculty and staff want to have access to all their info at their fingertips at any time.” Like Seton Hall, Roger Williams has designed its repository as a single system, where all of the content can be searched from a single point of entry.

Looking Forward

As more and more schools move to digital repositories, it seems inevitable that the knowledge accumulated across universities—yours, as well—will go digital. How smoothly your institution makes that transition will depend on how well your project is planned from inception, and how well you structure your data.

All of the administrators and educators we spoke with here say that they expect their future needs will only involve adding more disk space. Proper design and planning carried out by a combination of the faculty (the providers of the content), technologists (builders of the repository), and librarians (the cataloging experts), have created repositories that should be able to grow—without growing out of control.

Featured