Open Menu Close Menu

Infectious Adoption

The beauty of the Internet is in the quantity of data that can be found on it. The bane of the Internet is that the vast majority isn’t want you want. Search tools have made it easier for the user to find information, and they have been remarkably successful, at least if what you’re looking for are basic products or services. Today’s leading search tool has become so successful it has reached the status of a verb in colloquial speech (“Hey, let’s ‘Google’ that and see what we find...”). Google makes finding things on the Web fast and efficient.

Of course, we know that the search robots, including Google’s, can only discover a fraction of what is out on the net. The Internet’s just too big. Further, much of what is on the Web is hidden from search engines—they require specific knowledge about how to find and interpret resources stored in local digital repositories. Different repositories use different methods to expose their contents.

Trying to pull everything together in one place may have been reasonable once, but not any more. Instead, it makes sense to offer various ways of giving some degree of structure to the content on the Web, particularly if the meaning and value the content provided is defined by specific communities of interest.

Hiding Resources from Search Engines

Search engines like Google or AltaVista won’t pick up information about digital repositories because their robots don’t implement the necessary interfaces, such as CGI, that bring the resource information to the Web. These resources are accessible only to those who know to look in specific digital library repositories, generally those at collaborating universities who are developing collections and have agreed upon a particular interface by which the data in these repositories can be exposed.

Making information about useful things accessible and getting it delivered to the communities that can really use it can be a circular challenge. It requires considerable effort to describe things so they can be found additional effort to build the infrastructure to conduct the searches in useful ways across data stores, and still more effort to make all of this easy to implement. User demand drives adoption, but getting to the point that users have something to use requires demand. Lowering the barrier to getting tools into the users’ hands is the rock on which great ideas founder.

There are good technologies that have been in place for quite some time to help find information on the Web, but their very complexity has limited their value. Take z39.50, for example—a protocol for allowing records describing library holdings to be searched, and the results found communicated back to the user even when the data are in a system other than the one from which she is initiating the search. Libraries throughout higher education have largely adopted this technology making their online catalogs accessible to queries from users at other schools with different library systems. It works because the library automation vendor community has built this capability into online public access catalogs products.

As good as library holdings are, however, they are just a fraction of what is of potential interest out on the Internet. Hence, the benefit of something like z39.50 (or it’s next-generation version) is restricted. Building a low threshold (easy to adopt) mechanism to get search interoperability would bring to the surface the intellectual richness that is out there, making it possible to mine the “deep Web.”


“Data describing context, content, and structure of records and their management through time,” that is, metadata describes properties of semi-structured data records. [From ISO 15489—1:2001, Information and Documentation —Records Management—Part 1: General] QueryResult.CombinedQueryResult? query String=15489

Modeling Viral Infections As a Distribution System

A novel approach to the problem has been launched, modeled after viral infections—first you make something that is targeted for a host that is widely distributed in the population; then you make it incredibly easy to transmit it symbiotically, that is, in a way that d'es nothing to detract from the host’s general health. It just adds a capability that some will find helpful (a positive selective advantage). If all g'es well, you’ve jumpstarted dispersal of a new feature that will have staying power because it is low-cost, and high-value.

Apache and Mod_OAI

What is the most widely used Web server on the Internet today? If you answered Apache, you’re right. Nearly 64 percent of Web sites worldwide are served up using Apache. There’s the target host. One of the features of Apache that makes it so attractive is its ability to easily deploy new functionality through simple-to-install modules. These modules make it possible for Apache to take a request written for a scripting language, say Perl, and execute it. The mod_Perl module has added extraordinary flexibility to Apache Web servers. There are dozens of modules that add functionality to Apache’s basic Web server (see for examples). There will soon be one more.

According to its mission statement, “The Open Archives Initiative (OAI) develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” Efficient dissemination of content is what we need, so this group is definitely addressing the problem. One of the tools that it has developed to help achieve this goal is the OAI-Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH defines a mechanism for harvesting records containing metadata from digital repositories. The OAI-PMH gives a simple technical option for data providers to make their metadata available to services, based on the open standards HTTP (HyperText Transport Protocol) and XML (Extensible Markup Language). The metadata that is harvested may be in any format that is agreed upon by a community. [The unqualified Dublin Core metadata standard is specified to provide a basic level of interoperability.] Thus, metadata from many sources can be together in one database, and services can be provided based on this centrally harvested or “aggregated” data. Resources harvested by OAI-PMH are the kinds of in-depth technical and research and community-specific academic information that search bots like Google pass by.

Tying It All Together:
Low-Threshold Digital Searching of the Deep Web
With Apache servers as the host, the viral package to distribute is the high-performance federated digital search service based on OAI-PMH implemented as an Apache module, mod_OAI. Large amounts of data stored in digital repositories can then be found by students or faculty from their Web browsers, with minimal implementation overhead. What might this enable? From the browser of your choice you might be able to find articles, research reports, images of paintings from the Renaissance, or sound files of Woody Guthrie’s original performances collected from the federated digital repositories of libraries around the Internet. All of this depends upon the continued hard work of creating the metadata that describes these resources, which librarians and information professionals are doing every day. They need more support for their valuable efforts, making what has been done accessible and useful reinforcing the value of this work and the ability for deep resources to surface.

Using Apache modules as the distribution mechanism to lower the adoption threshold for this technology. It’s another example of how creative ideas, supported by insightful leadership from funding agencies, in this case the Andrew W. Mellon Foundation, are helping to shape the future of Web. It’s a brilliant strategy, elegant in its simplicity, and powerful in its potential. The ratio of signal versus noise on the Net has just gotten higher.


Apache Foundation, Last accessed 5-29-04.

Dublin Core Metadata Initiative (DCMI), Using Dublin Core, 2001. Available at: Last accessed 5-28-04.

Metadata Resources Guide, Available at Last accessed 5-23-04.

Mod_OAI—Getting OAI-PMH for free, Available at Last accessed 5-27-04.

Open Archives Initiative (OAI), 2002, Available at:

comments powered by Disqus