Taming the Digital Beast
Is your digital institutional repository out of control? It’s time to step back and look at contribution, access, rights, storage, and functionality—issues you don’t want to monkey with.
No one will dispute that academic institutions excel at generating
and collecting knowledge and information, but when it comes to
incorporating modern technologies, students have been farther
ahead of the curve than their institutions. Too many schools are
still mired in paper admissions processes, for instance, while their
students are actively trading MP3 files across the school’s Internet connection.
Though more gradually than their charges, schools are moving to modern
digital media as a means of archiving and accessing their vast stores of knowledge.
And campus library sciences professionals are partnering with IT to lead
the way as the data and information explosion propels the cause forward. Sharing
content has been a leading driver of the digital repository initiative,
because, simply put, unshared knowledge isn’t knowledge—it’s a secret.
“We’ve always been good at finding, selecting, acquiring, storing, and distributing
content in a variety of non-print media formats,” says Peter Deekle,
dean of University Library Services at Roger Williams University (RI). “But
we realized that much of the content we were generating was unique and
hadn’t been published in non-print formats. We thought: Shouldn’t we have a
hand in publishing this content?” In the past two to three years, the same
notion has occurred to other educators, as well.
In fact, an early adopter of digital repositories was Denison University (OH),
which set up an institutional repository using CONTENTdm from DiMeMa, software first developed at the University of Washington.
The Denison project was primarily used by the arts school to scan its large library
of images, and served as a learning experience for future projects, says Scott Siddall,
assistant provost and director of Instructional Technology at the school.
Still, the ramp-up has been slow at Denison and elsewhere because professors
haven’t been sold on the idea, he says. “It’s still a real push to get people to contribute material to a repository, based
on the sense that they will be sharing it,”
he explains. “They’ll say, ‘I have a unique
collection of such-and-such. Why do I
care if my colleague at XYZ University
wants access? Why should I take three
months to make it available?’”
The solution, he maintains, is a different
kind of carrot. “Institutions have to
say, ‘If you create a unique collection,
digitize it, put it up online, and let people
access it, that is scholarship; that is
valued, and we’re going to count it in
promotion and tenure.’ Then people will
put it on their radar screen,” says Siddall.
Geneva Henry, Digital Library Federation distinguished
fellow and executive director of the Digital
Library Initiative at Rice University
(TX), concurs. “You’ve got to have faculty
buy into it. And getting faculty to
agree to allow their publications to
reside in an institutional repository is
not easy, because publishers have them
convinced not to do it. So there’s a lot of
trust building that needs to go on with
the faculty,” she says.
Public Domain, to Start
Just what is going into the repository is
another matter. The potentially thorniest
issues—those of copyright and surrounding
concerns (who can access
what; when rights or access should
expire or content should be replaced/
updated)—has been ducked somewhat,
at least for the start, by putting only public
domain materials online.
Denison, for one, has used only public
domain materials, or materials created
by Denison faculty who have given permission
or have released the works under
a Creative Commons license. Creative Commons is
a nonprofit organization founded in 2001
by Stanford University (CA) law professor
and copyright scholar Lawrence
Lessig and several colleagues. The CC
license has a number of variations, and
the goal is to provide flexible copyright
options so that creators can specify conditions
under which they will share their
rights publicly—in essence stating,
“some rights reserved.”
"If you’re not giving the repository the
organization and metadata capability
Library Sciences administrators can
give it, it’s just a pile of junk."
— Paul Fisher, Seton Hall
In Rice University’s Connexions
repository, intellectual
material is freely distributed, and can be
used by any academic in his or her
courseware. Currently, the repository
contains more than 3,000 modules covering
143 different courses. A professor
could quite literally assemble a courseware
“book” on anything from the
British Parliamentary system to digital
signal processing, simply by mining the
contributions in the Connexions database.
On the flip side, contributors can
write and submit a whole book or just a
chapter on a given topic.
This means that a single book assembled
from the Connexions database can
comprise a dozen authors. If Rice had to
deal with copyright issues for each
author, the project would grow unwieldy
and eventually would become unusable.
“If you start getting restrictive, it gets
viral, and more and more material is
locked down,” says Henry. “It totally
ruins reuse when you start [using copyrighted
materials].”
Yet, there is quite a faculty education
process that needs to be undertaken,
because authoring material designed to
be given away online is not what academics
are used to doing, Henry points out.
“The initial reaction is always negative,
especially with the humanities,” she adds.
Those in engineering and the sciences,
where information changes so rapidly,
“get it” much more quickly, she explains,
and that is because those ever-changing
fields have a harder time keeping their
textbooks up-to-date, and are in general
more willing to embrace a communally
accessible concept like Connexions.
The aim of such repositories is to share
knowledge others may not be aware of,
and most are sticking with public domain/
non-copyrighted information. According
to Deekle at Roger Williams, that university,
for example, is sharing its many
unique resources about its namesake, the
founding father of the state of Rhode
Island. Other Rhode Island institutions
will also be sharing their unique information:
Brown University’s repository,
for instance, will share its resources on
public policy, and the University of
Rhode Island will share its extensive
collections in the biological sciences, a
field in which it excels. Institutional
repository projects can be broadly
focused, or they can be highly specialized,
like DialogPlus, a collaborative
project of Pennsylvania State University,
the University of Leeds (UK), UCSanta
Barbara, and the University of
Southampton (UK). This particular project,
launched in February 2003, was
designed solely to share geological data.
The Project Team: Design, Functionality, Support
Most librarians and technologists agree
that a digital repository must be a campuswide
effort that involves administrators,
campus technologists (particularly
developers), Library Sciences administrators,
and the heads of every department.
Of course, Library Sciences must be
involved because these individuals are the
experts in cataloging information, says
Paul Fisher, director of the Teaching,
Learning and Technology Center at Seton Hall University (NJ). “Ask yourself
this question: Why are we putting
these items in a digital repository?
Answer: We want to give access to people—
and those people are probably other
academics doing research, either professors
or students. But those who best know
how to conduct research are librarians;
that’s library science. So, not having
librarians help you design the database
and point out what data you need to collect
would be a major flaw in any project,”
he says. Clearly, without Library Sciences
on the team, proper planning cannot
take place. Planning at the start of a
digital repository project will make the
difference between a repository that
grows and remains highly usable, and one
that becomes an unwieldy monster.
Says Siddall at Denison: “I see a lot
of subject-matter experts diving into a
project of this sort, without involving
the right people immediately. They start
to catalog things into a metadata schema
that’s incomplete, and end up having to
go back and redo a lot of it later.” Proper
planning and defining of metadata
will help the repository remain easily
searchable as it grows and more content
is added.
Fisher at Seton Hall agrees. “If you’re
not giving it that organization and metadata
capability, it’s just a pile of junk.”
Library sciences personnel are not the
experts on the best ways to deliver the
data, however; that’s where the technologists
come in. They are the ones who’ll
advise that the dream repository can or
can’t be executed as envisioned; they’re
also the arbiters of cost.
“You have to carry out planning and
design with a database administrator,”
says Fisher. “Forgetting that is a big
mistake people make. Librarians are
information experts, absolutely, but
they’re not database administrators.
Having the technology and information
experts at the same table is critical.” If a
database is well-designed from the start,
there should be little maintenance needed
unless something g'es wrong, he
adds. “A repository should have the
capability to grow constantly, with only
one maintenance concern: ‘We’re running
out of storage.’”
But Henry at Rice believes there
should be an individual dedicated to the
task of maintaining the programming and
databases. She has a full-time programmer
dedicated to the university’s repositories.
“I would strongly recommend
someone dedicated to programming,
because you will always run into new features
you want to add,” she points out.
Institutional repository projects can be broadly focused, or they can be highly specialized, like DialogPlus — a collaborative project of Penn State, the University of Leeds, UC-Santa Barbara, and the University of Southampton.
As to responsibility for content management
and the determination of access
parameters, those things should be left up
to the departments, all those interviewed
here agree.
“We [Library Sciences administrators]
don’t want to be the exclusive gatekeepers
with an absolute final say,” says Deekle.
“That’s why the faculty have the responsibility
to say, ‘This is really important
and must be there,’ or, ‘You’ve made this
expansively accessible and we don’t want
everybody to access it.’”
Open Source vs. Packaged Software
Another concern when building a repository
is the choice of software: packaged
or open source? Siddall says the digital
asset management software market is a $3
billion industry with almost 600 vendors
and more than 1,000 products, but as consumers
of these products, higher education
is just a “little blip” compared to
government contractors like the Department
of Defense. Not surprisingly, cost
issues are always a concern for any campus
technology effort, and when it comes
to keeping costs down, the advent of
free, open source software has been a
blessing for many schools. Because so
many open source projects have their
roots in academia, it’s also not surprising
that there are some significant open
source digital repository efforts—top
among them, DSpace, developed by MIT
and Hewlett-Packard.
DSpace is in use at 138 universities and
institutions worldwide, including at Rice.
“We looked at some commercial software
as well,” says Henry, “but we’re
very much committed to open source at
Rice, and DSpace is becoming a more
and more mature platform.” She also
likes the way DSpace is designed with
digital media for an academic environment,
and she appreciates the fact that
the management tools are structured for
a university system. “You can control
access in a number of ways and delegate
authorization to submit materials at a
number of levels,” she explains. “We
don’t have a huge staff to support these
projects, so I needed a system where I
could push those privileges down as
deep as possible into the organization.”
Thus, DSpace allows the Chemistry
department, for instance, to define what
content it will accept, in what format,
and who can access it.
Still, Roger Williams decided against
DSpace for the same reason so many
other institutions and corporations have
shunned or minimized their use of open
source: The software may be free up
front, but you can get bitten on the back
end, they complain. Says Deekle, “Open
source is great from the acquisition standpoint, but the back-end support
required to maintain some of these solutions—
the custom programming needing
programmers of different languages
—was prohibitive. We don’t have the
kinds of resources available to handle
apps like that.”
Ultimately, Roger Williams went with
a commercial package from ProQuest
Company,
which specializes in document management
software for campuses. The university
liked the software’s management
functionality. “ProQuest will let us
make access as restricted or permitted
as the submitter requests. It handles
public rights and access very well, and
allows us to let the different departments
define who can access what,”
says Deekle.
However, the choices of software
aren’t as important as the format used to
store the data. Making sure to utilize
open, widely used data formats (and lots
of metadata) is what really matters, says
Siddall. “What’s nice about standards is
that there are so many of them,” he says,
reciting the old joke. “As long as you use
standards that are international in scope,
you’ll be fine.” That means, he says,
using the Dublin Core metadata standard,
JPEG 2000 for images, and the
Adobe Portable Document
Format (PDF), among others.
You’re Set Up. Now, What Belongs in the Repository?
Deciding what to put in your repository
can be a bigger task than some may realize.
The initial temptation is to throw
everything into it, but that impulse has
to be tempered by the reality of bogging
down your database with lengthy
searches—not to mention scanning all
of the data and attaching metadata information,
the latter of which is prohibitively
time-consuming, offers Siddall.
The reality is, “You have to fan the
embers to get people to contribute
because creation of metadata takes so
much work,” he says. “Performance is an
issue we can get around. The signal-tonoise
ratio [i.e., hits vs. misses], where
people find exact matches to what they
are seeking, is critical. That’s a qualityof-
metadata issue, and that’s why upfront
planning is required.” Yet, even with
optimal design, terabytes of storage, and
fast computer systems, there is a functional
consideration: Do you take the
“library” approach, where everything is
gathered in a single, large, central repository,
or do you break up the information
by schools or departments?
"With DSpace, you can control access.
We don’t have a huge staff to support
these projects, so I needed a system
where I could push those privileges
down deep into the organization."
— Geneva Henry, Rice
Seton Hall has taken the monolithic
route, at least for now, says Fisher. “I
might be making a big mistake, but it
seems to me that part of the power is to
search all Seton Hall publications for a
keyword, or one publication for a keyword,
and I don’t know how I would do
that full-swoop search if they were separated,”
he says. “If we had them separated,
people wouldn’t be able to search
them in a single search.” It all depends
on your back end, he adds, pointing out
that Seton Hall is using IBM blade technology and an Oracle database with a campus
license. “I guess if I were building
this on an Access database I’d be really
worried, but I’m not working with itty
bitty tools,” he jokes.
Management of the data is an important
issue, however, because if academics
scan in information, they want it
—all of it—readily available. Lesserused
data can’t be relegated to slower
servers or to magnetic tape; it must be
as readily accessible as the most popular
searches, says J'e Pangborn, CIO at
Roger Williams.
“In an ideal world, data would be
placed based on access statistics,” he
says. Some institutions practice lifecycle
management: Rarely used content is
archived, while frequently used material
remains on the fastest servers for quick
access. “But the culture here d'esn’t permit
that,” says Pangborn. “Our faculty
and staff want to have access to all their
info at their fingertips at any time.” Like
Seton Hall, Roger Williams has designed
its repository as a single system, where all
of the content can be searched from a
single point of entry.
Looking Forward
As more and more schools move to
digital repositories, it seems inevitable
that the knowledge accumulated across
universities—yours, as well—will go
digital. How smoothly your institution
makes that transition will depend on
how well your project is planned from
inception, and how well you structure
your data.
All of the administrators and educators
we spoke with here say that they expect
their future needs will only involve
adding more disk space. Proper design
and planning carried out by a combination
of the faculty (the providers of the
content), technologists (builders of the
repository), and librarians (the cataloging
experts), have created repositories
that should be able to grow—without
growing out of control.