The Need for Digital Archiving Standards

Campus tour guides at Yale University are known to tell a story about Yale's Beinecke Rare Book and Manuscript Library, one of the world's great document repositories and home to a copy of the Gutenberg Bible, the first Western book printed from movable type. Standing outside the Beinecke, the guides describe a remarkable mechanism that, should the terrible need arise, would cause its glass-encased central tower and its 780,000 volumes to withdraw deep underground, tucked away from any possible threat of destruction.

It isn't true, but when visitors see the illuminated pages of the Gutenberg Bible or peruse the papers of Samuel Clemens, the idea of a subterranean vault sounds prudent.

Librarians and archivists at colleges across the country couldn't agree more. Storing important works so future scholars have access is a vital part of what many university libraries and museums do. But as print-based publishing is outpaced by an onslaught of digital material, traditional archival methods are facing challenges.

The job of digitally storing and sharing that content is increasingly complicated. The Web, as just one example, is the largest living document ever created. At four billion public pages (and another 550 billion pages accessible via the "deep Web"), it is 55 times larger than the entire contents of the Library of Congress. Only 10 years old, the Web already is a fundamental resource for students and faculty, who find they are moving from a print-based world to one saturated with ever-changing digital content. The Web adds seven million new pages every day, but on average those pages disappear in 44 days (Lyman 2002).

Fortunately, universities and research libraries are committed to keeping up. They increasingly are incorporating digital information sources into their collections and curricula, while making digital records of physical archives. Yet they are also finding that the old, universally accepted ways to catalog and access information often no longer apply. Archives aren't purely physical places for the archivist, librarian or faculty member conducting research or teaching.

The Standards Conundrum
We can all be thankful that, through the ages, we've collectively decided what form a book should take. In Western civilizations, we know that we can find the title on a book's spine; we safely assume the contents will be printed on sheets of paper, ordered from left to right; we know how to use the college library with varying degrees of success.

Now, imagine if these little details weren't decided at all, leaving it to individual publishers to decide what a book really is. Or, if each college librarian created his or her own cataloging method, forcing us to completely re-learn the process of locating information from one library to the next.

It would, of course, be chaos, which is precisely what standards are designed to prevent. Nowhere is this more relevant than in the creation of digital archives, whose future is dependent upon standards—community standards that define common procedures, and technology standards that uniformly enable digital storage and retrieval. Pinpointing best practices and technologies for digital archiving is a core initiative of RLG, a consortium of 160 universities, national libraries, and other institutions. "Without standards, there is really no hope for digital archives to be usable many years from now," notes Merrilee Proffitt, program officer at RLG.

The discussion of standards can get overwhelming in no time. For instance, consider the quandary of the digital archivist, who must determine the right processes and technologies to create digital records of printed works, films, audio tapes, or images. Those creating archiving structures also must define easy ways for students, with little or no training, to quickly peruse their options and locate exactly what they need—preferably without first having to open a single digital file.

These issues tie directly into questions about technology. What format that exists today will sustain digital records for 20, 50, even 100 years? Content must be easy to capture, and it must be viewable using tools that are readily available. What's more, a file created in 2003 must be viewable in 2099.

SMU Dives into Databases

Law students at SMU's Dedman School of Law are undergoing their own digital revolution, in the wake of a similar transformation among leading professional law firms. Adjunct Professor Steve Kardell, who is also a corporate governance attorney in Dallas, supplements classroom materials with content from the online LexisNexis database and Web pages captured via the Web capture feature in Acrobat 5. "It's amazing for students to see how much content is out there in addition to standard classroom material," says Kardell.

Acrobat also simplifies the process of keeping archived materials up to date. "These are living documents, so we need a way to keep them current," says Kardell, who also is evaluating requirements for a permanent SMU digital law archive. "With Acrobat, authorized editors just go into the portal from time to time and update the material. You can highlight and annotate material in ways that people don't realize."

In the transient world of computer technology, that's a tall order. For instance, there's ASCII, the only electronic document format recognized by the National Archives and Records Administration. ASCII d'es a fine job of recording and displaying text—so long as it isn't Asian text, which requires more than ASCII's set of 128 characters. In fact, ASCII's usefulness stops well short of many materials headed for digital archives. It can't, for instance, accurately render a Web page featuring photos and reports from the World Trade Center attacks on Sept. 11, 2001.

HTML, the language used to display Web text, g'es one better by formatting text into layouts and identifying areas for photos or links. But not all HTML is created in precisely the same way—an HTML page may appear differently on your Web browser than on mine. When researchers and students need to study an item exactly as it appeared, the limitations of HTML become apparent.

In settling on the right technologies, archivists must match expected longevity with visual acuity. Peter Ullmann, a key participant in standards efforts at Adobe, puts it simply: "What's going to give you the sense that you're looking at something real?"

Though lacking a universally accepted standard, many schools already are hard at work building digital archives for curricular and administrative use. Typically, they establish their own processes and select the technologies that best meet their needs. Many of these systems incorporate the MARC (MAchine Readable Cataloging) system developed in the 1960s by the Library of Congress. The MARC system provides electronic access to bibliographical information for a library's inventory, and may serve as a model for much of the "metadata"—or contextual information—that will tag digital archives of tomorrow.

But that's tomorrow. Many universities already have successfully created distinctly different digital archives with Adobe Acrobat software, a low-cost authoring tool that easily generates any document in Portable Document Format (PDF) (see "Electronic Archives at Whitman College"). PDF is a broadly accepted, open specification for final-format documents that can be viewed using freely available Acrobat Reader software. PDF retains the format of the original document or Web page, so elements like pagination, photographs, and hyperlinks remain true to the original.

Electronic Archives at Whitman College

Professors at this small, liberal arts school in southeast Washington often require their students to use materials that are not available in the public stacks at the 385,000-volume Penrose Library. "A professor might have material he's gleaned over the years that he wants his students to have access to," says Michael Quiner, director of administrative technology at Whitman. "The old approach is to make photocopies, and then students go to the Reserve Desk to read them." For greater flexibility, Whitman administrators devised eReserve, a growing digital archive of reserve materials viewable online from students' dorm rooms.

In two years, the eReserve program has archived some 400 articles as PDF files that authorized students can read using a standard Acrobat Reader. In fact, Quiner says, several professors have started their own efforts to archive class material in PDF so students can conduct research online.

To College Librarian Henry M. Yaple, the benefits extend beyond allowing a student to read documents at all hours: "It's a matter of preservation," Yaple says. "eReserve allows a lot of people to read the documents without handling the primary material." Yaple's argument rings especially true with documents like Yale's Gutenberg Bible, whose rare pages could be viewed digitally by any religious studies student without ever touching the book.

Where We're Headed
While organizations like RLG work to define digital archiving standards, certain technologies are likely to find themselves at the forefront of the debate. Their prominence suggests that they will play at least some role as standards evolve.

One of these technologies is XML (eXtensible Markup Language), which is becoming vastly popular for many applications. XML allows information to quickly come together from various locations to form a Web document that can be easily read, and features an advanced approach to tagging content so that its components appear in their logical order once they reach their destination. XML appears to be an excellent candidate for supplying the technical backbone of a digital indexing system. "XML schema language can provide the universal structure that allows any school to look at technical metadata," explains RLG's Proffitt.

Maintaining accurate page format of the paper document, however, is not among XML's many strengths. This d'esn't matter if a student is reading the text of Dr. Martin Luther King Jr.'s historic "I Have a Dream" speech. But XML is at a disadvantage when a student must view an original document.

While XML excels at transporting information, PDF excels at displaying visually rich information. PDF preserves the pagination integrity of original documents, even when they are viewed on PDAs or next-generation wireless phones. Digital archiving is a marriage of data and documents. The two must live together, and for a very long time.

Adobe's recent development around PDF recognizes this. Acrobat 5.0 exports XML along with PDF, resulting in an XML-tagged document that retains its pagination no matter how it is reviewed. This would allow a journalism student to look up a story from yesterday's Los Angeles Times on her handheld. She knows the story appeared on page 16 in the print version. If the page is stored in PDF and tagged with XML, that's exactly where she'll find it on her PDA.

As new display devices become popular, these capabilities will be necessary to find and view records that were archived years before. Wharton's Kendall Whitehouse already has proven this is possible with PDF, viewing a document archived in 1995 on three different platforms: a desktop computer, a Palm OS handheld, and a Compaq iPAQ Pocket PC. When Whitehouse archived the document, those two handheld devices did not even exist (see "Wharton Students Straight to PDF").

Wharton Students Straight to PDF

In 1993, Wharton administrators began digitally preserving all school publications, catalogs, course materials, and faculty research papers. Yet the school's efforts don't stop at archiving. Some faculty members also have their students electronically submit their assignments as PDF files, which are then annotated with comments and corrections and re-posted for the student to view.

According to Kendall Whitehouse, Wharton's director of advanced technology, a successful archiving format must faithfully represent the original work without requiring complicated technical back flips. "A lot of formats depend upon the material being created in a certain way," says Whitehouse. "But Acrobat and PDF are completely agnostic. You can view the files on Windows, Mac, Unix, and handheld devices. None of these require manual altering or modification."

Archives must be easily viewable for years to come, a constraint that automatically narrows the technology field. "Our original PDF documents from 1993 have actually improved with age because the Acrobat Reader has evolved," Whitehouse says. "It's hard to list other formats that are both backward and forward compatible."

An indexing feature called e-Binding also allows educators to combine multiple content in various formats—images from a photo essay, maps from an atlas, spreadsheets, and text documents—into a single PDF file. And several legacy documents can be batch-processed to create a multi-layered work that can be searched for key words or phrases. For faculty, this offers the chance to create an online "course pack" for students that is easily indexed, searchable, and updated over time.

While PDF itself has become a de facto industry standard, two industry organizations are jointly working to establish an official archiving standard based on PDF technology, called PDF/A (see "Making the Case for PDF/A"). The groups are working to see PDF/A recognized by the International Standards Organization as a global standard for document archiving.

For now, digital archivists seem focused on tackling the issue of electronically documenting old and rare printed works, or capturing a Web page before it changes only a few hours later. And as immensely useful as digital archiving standards undoubtedly will be, educators point out that a ubiquitous system for higher education won't likely replace fixtures like research librarians anytime soon. "Tracking down information correctly is a tricky business, and it requires a skilled professional to do it," says Whitman College Librarian Henry Yaple. "That's what librarians do." With solid digital archiving standards, that job may become considerably easier.

Making the Case for PDF/A

A recent study estimates that the world's total production of information amounts to about 250MB—some 100,000 pages—for each man, woman, and child on earth. Printed documents comprise only .003 percent of the total (Lyman and Varian 2000).

99.997 percent of all information is digital—and it's growing fast. Some futurists anticipate that someday the world's knowledge will double every 900 days. The Census Bureau, for example, has accumulated 600 million pages of information from the 2000 Census that it will be transferring to the National Archives and Records Administration (NARA)—equaling 10TB of data. That's more than five times the amount of data that NARA has captured and fully processed in its entire 30-year history.

Yet because of their historical value, billions of documents need to be managed, preserved, and made accessible for future generations. This daunting task requires a solution that recognizes the wide range of information systems, technologies, and formats in which records are generated.

To a growing number of industry groups and users, one solution is PDF—a broadly accepted standard for the delivery of final-format documents. More than 20 million PDF documents are publicly available on the Internet, and almost half a billion copies of the free Acrobat Reader have been downloaded. PDF retains the content, look, and feel of the document exactly as it was created, ensuring document integrity and security, while also allowing documents to be searched. In fact, some countries already have accepted PDF as an archive standard. However, PDF has evolved to provide a number of functions that, while beneficial to users who share documents, are not ideal for long-term archiving: password-based security of documents, optional (rather than required) embedding of specific fonts, the ability to embed multimedia in other formats, and the ability to launch other applications from within PDF.

Consequently, a subset of PDF—PDF/A, with the "A" standing for archive—is being developed for archiving and preserving digital documents. PDF/A—a joint initiative by the Association for Suppliers of Printing, Publishing and Converting Technologies (NPES) and the Association for Information and Image Management, International (AIIM)—will address the growing need to electronically archive documents to ensure preservation of their contents over an extended period of time. PDF/A will also ensure that those documents can be retrieved and rendered with a consistent and predictable result far into the future.

PDF/A proponents (a working group comprised of industry, government, and academic institutions working with AIIM and NPES) are aiming to have PDF/A officially recognized by the International Standards Organization (of ISO 9000 fame) within approximately 18 months. Their efforts are directed at solving a serious and increasingly urgent problem. The lack of a recognized and accepted electronic standard for records preservation—particularly as new generations of hardware and software have made previous digital technology obsolete—has led to the loss of significant amounts of valuable information over the past several decades. Military files from the Vietnam War, records from the Viking Mars Mission, Census Bureau data and land use records have been lost due to the inability to read data formats and the deterioration of magnetic tapes used to store that data.

The list of organizations that mandate or use PDF as a de facto standard is growing to include the U.S. Courts, the National Science Foundation for grant submission, and the Food and Drug Administration for drug submissions. A common PDF/A standard will give librarians and educators the confidence that their records could be readily accessed far into the future.

Resources

Lyman, Peter, "Archiving the World Wide Web." LOOP: AIGA Journal of Interaction Design Education, December 2002, Number 6. Retrieved from http://loop.aiga.org/content.cfm?ContentID=100 on Jan. 22, 2003.

Lyman, Peter and Varian, Hal R., "How Much Information," 2000. Retrieved from www.sims.berkeley.edu/how-much-info on Jan. 26, 2003.

Featured