Open Menu Close Menu

CT Visionary :: Michael Keller

DIGITAL REPOSITORIES: A GLOBAL WORK EFFORT

Stanford's Keller weighs in on petabyte-scale digital object storage systems.

Michael KellerStanford University (CA) Librarian Michael Keller was among the leading digital archiving experts who headed to Paris this past November for the inaugural meeting of the Sun Preservation and Archiving Special Interest Group, a Sun Microsystems-sponsored community dedicated to working on the unique problems of storage and data management, workflow, and architecture for very large digital repositories. Sun PASIG brings together a large group of organizations for an ongoing global discussion of their research and to share best practices for preservation and archiving. Here, CT asks Keller for his perspectives on the effort, and on Sun PASIG's overall goals.

What sparked your professional interest in the work of Sun PASIG? More than 10 years ago, we in the library profession began to realize that we had to take responsibility for preserving-both for the long term and for access-the digital objects that were coming to us in increasing waves and numbers of flows, from varying sources. Over those 10 years, a lot of developments took place and a lot of projects launched, but none of them were particularly large-scale-at least the ones that anybody can talk about. We know that the government and secret agencies are doing a lot of big-scale gathering, but we don't know whether they are preserving anything. So, we need both software and hardware technology that can [work well] across very complex hardware arrays, but can also ingest across a very wide variety of data formats and what we might call "digital genres."

What is your own institution's perspective on that need? At Stanford, we recognized about five or six years ago that the university was producing various kinds of digital information on the order of 40 terabytes per year-as well as consuming information on the order of 40 terabytes per year. And, of course, that number has only increased in the intervening five years. Within those years, Stanford also signed on for the Google Book Search project, which, if our original ambitions are realized, will initially yield something on the order of a petabyte-and-a-half of digital information, for an initial database at Stanford of the books sent forward for Google to digitize. And that would be the first copy of the material, before we do anything to it. So, with those kinds of numbers floating around, we realized that we had to have a comprehensive solution to the problem of preservation of bits and bytes, the problem of access to copies of those files [for redundancy], and the problem of ingesting at a very, very high level in order to get the digital goods into the digital repository.

Starting four years ago, we acquired a big-tape robot and some spinning disk technology that was intended to help us understand how to manage the huge flow of digital objects that we need to ingest and preserve. We found that what was missing was a very effective spinning disk technology. We'd experimented with a few of them, and frankly, there were points of failure with most of them that revealed themselves in operation. But Honeycomb [a storage technology recently introduced by Sun Microsystems]- which we've tested very extensively, subjecting it to all the same stress tests and the same experiences as other technologies-has proven to be quite robust. In fact, it was when Sun came up with the Honeycomb technology, which we started beta testing a little more than a yearand- a-half ago, that we realized we had the last piece in what I hope will be the first generation of hardware architecture that will handle very, very large digital archives.

Is it true that the Sun Honeycomb technology is combining the storage disk array with compute functions? Yes, Honeycomb has CPUs to run programs in the array. From my perspective, that's the beginning of creating interoperable information objects: so the storage array can compute on the objects that are in it. And I'm not sure how far that goes with this version of Honeycomb, but it seems clear to me that's where it's heading.

And what about Honeycomb's approach to redundancy? That's an important matter. Think of Honeycomb as an array of 32 spinning disks. The firmware that runs those disks takes the files in, and it distributes a few copies of those files onto several of those spinning disks. It distributes them so that should one of those disks fail, various others will contain the digital object that you put in. It's an instantaneous redundant storage solution that manages itself. So, it handles the redundancy and it does so without us having to manage it.

Even so, do you need more than one approach to protect such vast amounts of important data? We know that we have to have a combination of magnetic disk technology, near-line tape storage, and offline tape storage-which we will have to carefully manage for the very long term, until we see different technologies becoming available to us or different technologies becoming more robust and more appropriate to the various missions we have set for ourselves. At Stanford, we're doing both near-line and online storage, and indeed, offline as well, to protect us in the long haul. We're also looking for one or two partners in storage away from North America, who would support one another against catastrophic failure in a kind of warm failover site situation.

Finally, from a broader perspective, what has the Sun PASIG set out to accomplish? This past June, at an initial meeting [to lay the groundwork for] the preservation and archive special interest group, we had a dozen institutions in attendance, such as the British Library, Oxford University (UK), the Bibliotheque Nationale de France, The Johns Hopkins University (MD), and the National Library of Sweden. The goal for this gathering was to be a kind of an instant peer review group. That meeting was focused mainly on business drivers and architecture. But the [full, inaugural meeting this past November] was much larger in terms of numbers of institutions and people, and its concerns were expanded to workflows, policy, and use cases. We still wanted to spend time on architecture specifications, design specifications, and software and hardware choices, but our intent was to broaden the conversation because there are serious issues around what will certainly become very, very large digital archives.

Previous developments like DSpace and Fedora are providing us with evidence that if the initiating institutions work hard and produce some experiences to discuss in our professional publications and meetings, then in the next generation, we may end up at a place where institutions without the same kind of IT prowess-without the kind of great IT support in the form of programmers, database analysts, systems administrators, and managers-may be able to run equally large digital archives without needing that initial big investment.

::WEBEXTRAS ::
More on Sun's Honeycomb technology. Exclusive interview with Michele Kimpton, executive director of the DSpace Foundation.

comments powered by Disqus