Home > Google Book Search: The Good, the Bad, & the Ugly

Digital Libraries

Google Book Search: The Good, the Bad, & the Ugly

1/1/2008

Processing Book Files

Image compression isn't a small issue in mass digitization projects, says Mark McKinney, VP of business development for LuraTech, a company that produces compression software. Besides consuming storage, the files developed out of scanned books need to be delivered across the web with no perceivable delays. LuraTech powers the work done by the Open Content Alliance, which applies the JPEG 2000 format to compression; JPEG 2000 is a powerful long-term archival format that reduces a large color file to about a hundredth of its original size, says McKinney. In that effort, he says, workers run the process from digitization stations called "Scribes" that take the picture from a page, color-correct it, and then "OCR" it (apply optical character recognition) so it becomes searchable. Once the operator has captured all the individual pages, metadata is added to the book through a user interface. But the metadata-title, author, copyright, description, etc.-isn't necessarily added via human effort.

According to Brian Tingle, a technical leader for the Digital Special Collections (part of UC's CDL), much of the metadata is already cataloged as part of the online public access catalog (OPAC), known in the pre-digital era as the card catalog. Tingle's team works with a metadata object format, a standard for encoding and transmission of digital objects. "Those objects get turned into those formats and that's how we ingest them into our system," he explains. It's a different level of metadata that enables the linking together of objects, such as the pages of a book.

The automation of data capture is certainly something in which a former- NASA AI expert like Google's Clancy would excel. "If you look on our book reference page, you'll find related works identified; books with some relationship to the book you're looking at," he points out. "Or you'll find something we call ‘Popular Passages,' where we've extracted passages that are seminal or popular and mentioned in a number of different books. We use that as a way to link some of these books together." Achieving those connections, he says, is a programming job. "It may not be perfect, but this is 100 percent how we've done it. We don't have people picking out related books; we use lots of different signals. We just don't talk about which signals we use."

Beyond Automation

Not surprisingly, when the wizard behind the curtain is Google, the same kind of secrecy applies to search. Where UC's Tingle is highly forthcoming about the search product his team has developed at CDL-eXtensible Text Framework (XTF), which is based on Lucene, an Apache open source search engine-Google's Clancy prefers to focus on search outcomes. "We're all familiar with how search works on the web," says Clancy. "You type in a keyword phrase and suddenly it seems to find just the document you want; people create link structures that relate two things together. Well, as soon as you do that [with a book], you're giving us more information about that book. Eventually, you can imagine people linking to books from their web pages and other things. A book should be like the web: People should link directly into the book when it's relevant to them."



Recommended Reading