The beauty of the Internet is in the quantity of data that can be found on
it. The bane of the Internet is that the vast majority isn’t want you
want. Search tools have made it easier for the user to find information, and
they have been remarkably successful, at least if what you’re looking
for are basic products or services. Today’s leading search tool has become
so successful it has reached the status of a verb in colloquial speech (“Hey,
let’s ‘Google’ that and see what we find...”). Google
makes finding things on the Web fast and efficient.
Of course, we know that the search robots, including Google’s, can only
discover a fraction of what is out on the net. The Internet’s just too
big. Further, much of what is on the Web is hidden from search engines—they
require specific knowledge about how to find and interpret resources stored
in local digital repositories. Different repositories use different methods
to expose their contents.
Trying to pull everything together in one place may have been reasonable once,
but not any more. Instead, it makes sense to offer various ways of giving some
degree of structure to the content on the Web, particularly if the meaning and
value the content provided is defined by specific communities of interest.
Hiding Resources from Search Engines
Search engines like Google or AltaVista won’t pick up information about
digital repositories because their robots don’t implement the necessary
interfaces, such as CGI, that bring the resource information to the Web. These
resources are accessible only to those who know to look in specific digital
library repositories, generally those at collaborating universities who are
developing collections and have agreed upon a particular interface by which
the data in these repositories can be exposed.
Making information about useful things accessible and getting it delivered
to the communities that can really use it can be a circular challenge. It requires
considerable effort to describe things so they can be found additional effort
to build the infrastructure to conduct the searches in useful ways across data
stores, and still more effort to make all of this easy to implement. User demand
drives adoption, but getting to the point that users have something to use requires
demand. Lowering the barrier to getting tools into the users’ hands is
the rock on which great ideas founder.
There are good technologies that have been in place for quite some time to
help find information on the Web, but their very complexity has limited their
value. Take z39.50, for example—a protocol for allowing records describing
library holdings to be searched, and the results found communicated back to
the user even when the data are in a system other than the one from which she
is initiating the search. Libraries throughout higher education have largely
adopted this technology making their online catalogs accessible to queries from
users at other schools with different library systems. It works because the
library automation vendor community has built this capability into online public
access catalogs products.
As good as library holdings are, however, they are just a fraction of what
is of potential interest out on the Internet. Hence, the benefit of something
like z39.50 (or it’s next-generation version) is restricted. Building
a low threshold (easy to adopt) mechanism to get search interoperability would
bring to the surface the intellectual richness that is out there, making it
possible to mine the “deep Web.”
Modeling Viral Infections As a Distribution System
A novel approach to the problem has been launched, modeled after viral infections—first
you make something that is targeted for a host that is widely distributed in
the population; then you make it incredibly easy to transmit it symbiotically,
that is, in a way that d'es nothing to detract from the host’s general
health. It just adds a capability that some will find helpful (a positive selective
advantage). If all g'es well, you’ve jumpstarted dispersal of a new feature
that will have staying power because it is low-cost, and high-value.
Apache and Mod_OAI
What is the most widely used Web server on the Internet today? If you answered
Apache, you’re right. Nearly 64 percent of Web sites worldwide are served
up using Apache. There’s the target host. One of the features of Apache
that makes it so attractive is its ability to easily deploy new functionality
through simple-to-install modules. These modules make it possible for Apache
to take a request written for a scripting language, say Perl, and execute it.
The mod_Perl module has added extraordinary flexibility to Apache Web servers.
There are dozens of modules that add functionality to Apache’s basic Web
server (see http://httpd.apache.org/docs-2.0/mod/
for examples). There will soon be one more.
According to its mission statement, “The Open Archives Initiative (OAI)
develops and promotes interoperability standards that aim to facilitate the
efficient dissemination of content.” Efficient dissemination of content
is what we need, so this group is definitely addressing the problem. One of
the tools that it has developed to help achieve this goal is the OAI-Protocol
for Metadata Harvesting (OAI-PMH). OAI-PMH defines a mechanism for harvesting
records containing metadata from digital repositories. The OAI-PMH gives a simple
technical option for data providers to make their metadata available to services,
based on the open standards HTTP (HyperText Transport Protocol) and XML (Extensible
The metadata that is harvested may be in any format that is
agreed upon by a community. [The unqualified Dublin Core metadata standard is
specified to provide a basic level of interoperability.] Thus, metadata from
many sources can be together in one database, and services can be provided based
on this centrally harvested or “aggregated” data. Resources harvested
by OAI-PMH are the kinds of in-depth technical and research and community-specific
academic information that search bots like Google pass by.
Tying It All Together:
Low-Threshold Digital Searching of the Deep Web
With Apache servers as the host, the viral package to distribute is the high-performance
federated digital search service based on OAI-PMH implemented as an Apache module,
mod_OAI. Large amounts of data stored in digital repositories can then be found
by students or faculty from their Web browsers, with minimal implementation
overhead. What might this enable? From the browser of your choice you might
be able to find articles, research reports, images of paintings from the Renaissance,
or sound files of Woody Guthrie’s original performances collected from
the federated digital repositories of libraries around the Internet. All of
this depends upon the continued hard work of creating the metadata that describes
these resources, which librarians and information professionals are doing every
day. They need more support for their valuable efforts, making what has been
done accessible and useful reinforcing the value of this work and the ability
for deep resources to surface.
Using Apache modules as the distribution mechanism to lower the adoption threshold
for this technology. It’s another example of how creative ideas, supported
by insightful leadership from funding agencies, in this case the Andrew W. Mellon
Foundation, are helping to shape the future of Web. It’s a brilliant strategy,
elegant in its simplicity, and powerful in its potential. The ratio of signal
versus noise on the Net has just gotten higher.