Google Book Search: The Good, the Bad, & the Ugly -- Campus Technology

Google Book Search: The Good, the Bad, & the Ugly

By Dian Schaffhauser
01/01/08

Google Book Search: The Good, the Bad & the Ugly Yes, Google is opening up whole new worlds for internet surfers and researchers everywhere-even before the model is ready.

FORGET EVERYTHING YOU BELIEVE about Google's book digitization project. Once you get past the freakishly high numbers bandied about, the two-dozen-plus distinguished institutions that have signed on, the legal paranoia and the ultra-ultra-secret processes and technologies involved-you'll find that Book Search (from the fifth most valuable company in America) is simply another high-cost effort that is simultaneously visionary and crude. It doesn't even have to succeed in order to impact the transformation of scholarship activities.

Here's the magic: Type "sonoma" and "mission" into books.google.com and choose "Full view" to eliminate those books that haven't granted permission to be fully displayed or that are still in copyright because they were published post-1923. About 550 titles show up, almost all of which you can view in text format or as a PDF file. Perhaps the oldest reference that will appear is a volume titled An Overland Journey Round the World During the Years 1841 and 1842 by Sir George Simpson, governor-in-chief of The Hudson's Bay Company's territories. Google digitized the 1847 volume from the collection of the New York Public Library, as the bar code on the cover shows (along with a small portion of what looks to be a human arm, probably belonging to the person scanning that particular title).

LISTEN IN

Download our podcast interview with Robin Chandler, former director of data acquisitions for UC's California Digital Library.

As a reader, you might consider the discovery of this long-lost tome a modern-day miracle, akin to stumbling on the bones of a previously unknown dinosaur while digging in your garden. And even though you never have to leave your keyboard to read the contents, you could click a link on the page, enter your ZIP code, and find the nearest library that has the book in its collection, in case you're the kind of person who likes to touch actual pages and take in the perfume of old book stock.

Somehow (although the details are mostly sketchy to those outside the company), thousands of books are working their way through the project every business day to join the millions of other publications already included in Book Search. But let's take a look at how Google is working with one of its partners-the University of California system-to keep the process humming along.

Google Book Search: Typical User View

GOOGLE BOOK SEARCH, which is still in beta after several years of testing, offers the ubiquitous Google search box on its home page. It also has categories of books as well as book cover images that refresh every time the home page is refreshed.

Once the user searches for a book and pulls up its record, one of two screens appears, depending on whether the book is under copyright or not. Copyrighted books display a limited number of pages ("snippets"), including the cover and back cover, the table of contents, the index, and some content pages. For Steven Levitt and Stephen Dubner's Freakonomics (William Morrow, 2006), for instance, Google shares 29 pages of the 256-page tome. But users can pull up a full-screen view of any of the 29 pages, write a review of the book, add it to their online libraries, view the table of contents and some popular passages, search the contents (only page numbers may show up), click through to other editions of the same work, click to sources where you can buy the book or find it in a library (a link that connects you to WorldCat), and read paid sponsor links.

For books no longer under copyright (current criteria: those works published earlier than 1923, such as an edition of Adam Smith's The Wealth of Nations published in 1895 [T. Nelson and Sons], or books that the publishers have granted full viewing rights to), the same features exist, though not all include paid sponsor links. For some books, users can view a PDF edition (with download size shown) or view a plain-text version of each page.

When read in full-text mode, the non-copyrighted books allow Book Search users to view a single page or facing pages, simultaneously, and make an annotation to be included in Google Blogger or Google Notebook. Perhaps more helpfully, users can copy a link provided by Google, and forward it to others. When used correctly, that function can take another reader to the same page and a specific clip captured by the original reader with a dashed line marking off the content. Some books even include a Google Maps mashup showing "Places mentioned in this book."

Quality vs. Quantity?

The UC system consists of more than 100 libraries spread across 10 campuses around the state, containing more than 34 million volumes, which inhabit 3.6 million square feet of library building space. According to the UC-Berkeley library website, in North America the holdings of the state university system are surpassed in scale only by the Library of Congress. Just under a third of the collection is housed in two regional facilities, one located on the campus of UCLA serving the southern campuses; and the other in an industrial area of Richmond outside of San Francisco, serving the northern schools. This makes the digital effort easier, since neither Google nor the UC librarians need to scurry from campus to campus to obtain books to scan. Yet it also suggests the possibility that Book Search is actually a project all about numbers over quality-an implication that neither UC nor Google denies.

In fact, when UC signed its contract with Google in July 2006, UC agreed to provide no less than 2.5 million volumes to the digitization effort over the course of the agreement's six-year period. That's just under 420,000 books a year, or less than a tenth of the annual circulation of materials throughout the UC library system, which circulated 4.7 million items in the 2005-2006 academic year. According to the contract, after an initial ramp-up of a couple of months of delivering 600 books per day, UC was obligated to crank up delivery to 3,000 per day. And that, says Robin Chandler, former director of data acquisitions for UC's California Digital Library (CDL), is exactly what UC's goal has been. (This month, Chandler is moving to a digital library position at the University of California-San Diego.)

Chandler, who held the CDL role for seven years, worked with a multitude of libraries inside and outside the UC system, to help guide their digitization efforts. That includes Calisphere, a public gateway to 150,000 digitized items (including diaries, photographs, political cartoons, and other cultural artifacts representing the history and culture of California), as well as the Online Archive of California, which brings together historical materials from a number of California institutions including museums, historical societies, and archives.

Do Authors Want to Be Digitized?

GREG SCHULZ, FOUNDER OF AND SENIOR ANALYST for The StorageIO Group-and interviewed for this article-is also the author of the book, Resilient Storage Networks (Digital Press, 2004). It doesn't bother him in the least that Google Book Search might scan his entire book and make pieces of it available.

"I'm fine with that," he declares. "If it allows my work to be more widely known so that people buy the book or engage in other related services, I'm all for it. I'll gladly give up some book sales if it leads to something else."

The sticking point? When Google or other book-search projects "start leveraging the work, or doing things with it. Then it gets into another dimension," says Schulz, pointing to the recent Writers Guild strike. "At the center of that is: How do new media efforts affect royalties? What happens?" In other words, a green light for now doesn't mean a green light forever.

"I've worked on a lot of projects that have had complex partnership [components]," she says. "So [CDL] asked me to work on the mass digitization activities." That mandate surfaced two years ago, first with the Open Content Alliance, a nonprofit effort that's part of the Internet Archive project; then with Microsoft as part of its Windows Live Book Search; and most recently with Google Book Search. Acting as the program liaison for those projects, she says, consumed about 75 to 80 percent of her time at CDL.

How does UC deliver 3,000 books a day to Google? It isn't by being overly selective. And it doesn't involve rare materials that aren't part of the circulating collection. "All of the libraries are talking about that, in the sense of what might be the most interesting materials to scan," says Chandler. "But I'll be very frank: There's a real balance point between volume and selection, especially when looking at these numbers. UC is trying to meet the needs of the contract it's signed."

Ultimately, the library has to perform bulk selection, "which means choosing both in-copyright and out-of-copyright," she says. "So without having to worry about publication dates and such, you're literally able to clear shelves."

Google Book Search: The Good, the Bad & the Ugly

KIRTAS TECHNOLOGIES' APT BookScan 2400 Gold robotic scanner is capable of digitizing 1,344 books a week.

The issue of copyright was something Google stumbled over early in its founding of Book Search (previously dubbed the Google Print Library Project, even though users apparently weren't allowed to print anything). After a brief hiatus, the site was modified to reflect the more copyright-holder-friendly practices of competitive offerings from Microsoft and the Open Content Alliance. It comes down to this: Full text is available for out-of-copyright materials and for copyrighted books from publishers who allow it; limited content is displayed for newer books. Lawsuits are still pending.

As Chandler describes it, a staff member removes the entire shelf of books, places the books on a book truck, then moves on to the next shelf, "until, essentially, the quota for a day is reached. Then they're checked out." Although Google doesn't have a UC library card per se, the books heading off for scanning go through the same checkout process as any volume leaving the facility. Their bar codes are read and a manifest is compiled, "to be able to account for a day's shipment," she explains. "It's very important not to lose a book anywhere along the way."

GOOGLE'S SCAN OF this page from an 1888 edition of Plato's The Trial and Death of Socrates suggests that humans-and human error-are a large part of the Book Search digitization process.

Behind Closed Doors…

Once the volumes move through the checkout, they're purported to be loaded onto another truck-one which takes the volumes to an undisclosed location where the Google scanning facility is set up. At that point, operations become a black box. (It's possible that the scanning occurs at the regional UC facilities themselves, but UC staffers aren't talking. Citing proprietary concerns, Chandler declines to answer questions about scanning operations, and Dan Clancy-engineering director for Google, in charge of leading the Book Search team-is just as cagey.)

"When it first started, the technical challenge was simply building a scanning device that worked," Clancy says. "The next technical challenge was being able to run this scanning process at scale. We would have been quite happy to use commercial scanning technologies if they were adequate to scale to this. We only built our own scanning process because that was the way to make this project achievable for Google."

Book Search Today: A Researcher's View

SUSAN FARMA IS WORKING ON her master of arts in humanities at California State University-Dominguez Hills. Because she's working full-time as an application manager for the Los Angeles Philharmonic, Farma was thrilled when she learned about Google Book Search. "Anything that saved time I considered a boon," she says. "But it has not provided the help in research that I hoped." Her complaint: "I find it has severe limitations. For instance, if the book is not in the public domain, the snippet view only shows you the word you searched for and a few words around that word. There is no way to tell from the half a sentence that [Google shows you], whether buying or borrowing the book in question would be able to advance your thesis." In addition, says Farma, "books in the public domain online are few and far between yet, and most of them are extremely old. [Book Search] works great for classics in, say, literature, but not for individual subjects that you may be interested in researching."

Let's look at what Google may have rejected as inadequate to do the job: The APT BookScan 2400 Gold, the fastest commercial offering from digitization vendor Kirtas Technologies, scans books at a rate of 2,400 pages per hour. The product costs between $100,000 and $175,000 and includes two cameras, each pointing downward at a 45-degree angle. In a video on the company's website, a worker is shown placing a book in a cradle and making adjustments for its size. As both pages are whipped through the scanning simultaneously, a robotic arm that looks like a waffle iron adheres to a page and flips it for the next photo shoot to occur.

The question is: Is that fast enough to keep up with Google's demands? At an average size of 300 pages per book (a count cited by UC's Chandler), Kirtas equipment is capable of scanning eight books an hour or 64 books in an eighthour workshift. If scanning operations were running around the clock and staffers never took breaks, called in sick, or experienced equipment outages, the tally would reach 1,344 books a week per machine. Keeping up the pace of those 15,000 books a week fed by UC would require 12 of the Golds. Yet apparently, Google is using something else it considers superior.

When a Text Isn't Text

THE PAGES OF TEXT SHOWN through Book Search are actually images, not text. Although as part of Google's digitizing process a conversion takes place to turn a scanned page into text, the publicly offered results are less stellar than those made possible by the better-known OCR applications such as Abbyy FineReader, which is used by compression software provider LuraTech as part of its PDF conversion solution.

Frequently, an out-of-copyright book in Google will include a "View plain text" function, but the user will be shown a page displaying only "No text" at the top-meaning that Google was unable to convert that particular page into plain text. And if a user's keyword search turns up such a page, Book Search still succeeds in locating and highlighting the search terms, even if it can't seem to display the page in plain-text form. It's almost as if two separate optical character recognition systems are in play: one for the search engine, and another for converting scanned pages into plain text. This inconsistency may not trouble most readers; but those who are print-disabled and need to use a screen reader or convert the text to a speech reader, say otherwise.

Susan Gerhart holds a doctorate in computer science and has worked in research and management in software engineering and technology transfer at Duke University (NC), NASA, the National Science Foundation, USC's Information Sciences Institute, and Embry-Riddle Aeronautical University (FL). Gerhart is also legally blind. As she points out in her blog, As Your World Changes, her experiments in using Book Search have turned up this anomaly, for settings that turned images off in her browser. "I got a snippet of page text, a big empty block of missing image, and various book metadata, including where to buy or borrow," she says. When she tried turning images on, "Ouch, was it bright," she recalls.

She writes: "There's nothing in, around, or any way out of the image into screen readable mode. The image might as well have been a lake, a building, or porn for all the information I could glean from it. I wondered why the omnipotent Google toolbar, gathering data about my searches, and offering me various extra search information, could not also be the reader." Gerhart is doubtless not alone in her frustration.

Linda Becker, the VP of sales and marketing for Kirtas, doesn't believe that Google has somehow created a faster digitization process. "I do know what they're doing, and I can't comment on it," she says. "But what I can say is this: They're not scanning faster, they're not digitizing faster, and they don't have the quality controls that the user deserves."

She may be right: In an ongoing online debate about whether Google is using robotic machinery or human beings to flip the pages, bloggers have poked fun at the search giant's quality control methods (or lack of them) by posting screenshots that reveal hands, fingers, and arms in Book Search results. Becker suggests that those screenshots may not be anomalies. "If you go into Google [Book Search] and look at any book, you'll be able to see by the number of body parts and fingerprints that [the pages] are being turned manually."

Although Clancy won't describe the actual process or equipment being used for Book Search, he does point out that one of the reasons he was recruited by Google (from a lengthy career at NASA) was because, "One, I had a strong AI [artificial intelligence] background. Two, I had a lot of experience dealing with complex systems that had lots of mechanical components along with software components. And three, I had the ability to do things to scale-an important part of the Books project. There are a lot of software complexities [in that]," he concedes, "but also a lot of people complexities."

Parlez Vous… Telugu?

BOOK DIGITIZATION PROJECTS aren't new. Carnegie Mellon University's (PA) Universal Digital Library (UDL), which has been in the works since 2001, recently announced that it had digitized 1.5 million books, including 971,594 titles in Chinese, 49,695 in Telugu, 39,265 in Arabic, and 20,931 in Kannada (Telugu and Kannada are both languages of southern India), among other languages. That emphasis on multiple languages sets UDL apart from other book digitizing efforts. The volumes are being scanned by universities and organizations in multiple countries-the US, China, Egypt, and India-and are made available free in three formats: HTML, TIFF, and DjVu (a PDF alternative). Although the details may differ, the goal of the initiative sounds familiar to those who follow such matters: "to create a universal library which will foster creativity and free access to all human knowledge."

There's more than that at stake, insists Kirtas' Becker. The actual scanning process isn't what's important in these projects, she points out. "People get confused between digitizing and scanning. When you scan a book, you get what you get. Digitizing is what Kirtas does. Once we scan a book, we take it through a digitizing process." That encompasses multiple steps, she maintains: segmenting the book (converting pages to black and white from color, if that's how they started out), performing background cleanup, converting type size (such as for applications for the visually impaired), changing the book size for printing purposes, and moving the digitized content into other file formats such as for online reading or PDF viewing.

"Right now," says Becker, "scanning is irrelevant. What is relevant is this: How do you create the highest quality digital file with the smallest file size that's repurposable so that you can extend the life of it?" She believes that's what the Google Book Search project is missing: a focus on quality. "If you were to go to the Google site, you'd see that one out of every five pages is either missing, or has fingers in it, or is cut off, or is blurry."

Still, avoiding at all costs the odd missing page or disembodied digit may not be a driving force behind Book Search right now. UC's Chandler notes the qualitative differences between the output produced by the three mass digitization efforts she was involved in at CDL. "The actual presentation of the book is quite different. If you look at what Google does, it's really a bitonal representation. It's as if the book were brand new, which is just to say that the page is white [and] the ink for the font is black. Whereas if you look at the Microsoft [Windows Live Book Search] presentation, it's a color image, so you get the sense of it as an artifact."

In point of fact, it's possible that a good number of books submitted for scanning are out-and-out rejected in the Google process. According to UC's Southern Regional Library Facility annual report, it had scanned 33,503 volumes for the Open Content Alliance project in one year. (Neither Google's nor Microsoft's numbers were provided.) An additional 16,988 volumes pulled off the shelves for scanning were rejected, mostly because they had tight margins, large foldouts, or brittle paper. For every two books successfully scanned, another one was rejected and added to a list tagged "With the hope of going back." There's little reason to believe that Google's success rate is dramatically different.

Storing a World of Files

Google's Clancy says the current database of books in Book Search contains "millions and millions" of volumes. That requires a lot of storage work, no new challenge for Google.

Greg Schulz, founder and senior analyst for The StorageIO Group, observes, "Knowing Google, they're storing it the same way they store all their other data: They're using clustered nodes of processors with drives in them-a Google storage grid." The Google data center model, which has been well documented and marveled over, is to leverage commodity servers (X86 and AMD) in volume. These are servers, says Schulz, "that you can buy very, very inexpensively, and that give you a good balance of performance and storage capacity for a low cost." And when Schulz says volume, he means tens-possibly hundreds-of thousands of servers.

On top of that hardware runs Google software: the Google File System, the Google storage management tools, and other layers-"for monitoring and making sure the hardware is running efficiently, that it's healthy, and that the data is protected. That way, if a server fails, the others can pick up the workload," Schulz says.

The actual data recorded by the scanning process is probably maintained in Bigtable, a distributed storage system for structured data. As described in a white paper published by Google in 2006, "Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability." While the paper doesn't mention Book Search by name, it does state that Bigtable is used by more than 60 Google products and projects.

Although the equipment and software for Book Search has evolved to become a string of proprietary systems, Clancy insists that his company uses commercial standards where they work. That includes image compression standards like JPEG (and its successor JPEG 2000), PNG, TIFF, and PDF.

Processing Book Files

Image compression isn't a small issue in mass digitization projects, says Mark McKinney, VP of business development for LuraTech, a company that produces compression software. Besides consuming storage, the files developed out of scanned books need to be delivered across the web with no perceivable delays. LuraTech powers the work done by the Open Content Alliance, which applies the JPEG 2000 format to compression; JPEG 2000 is a powerful long-term archival format that reduces a large color file to about a hundredth of its original size, says McKinney. In that effort, he says, workers run the process from digitization stations called "Scribes" that take the picture from a page, color-correct it, and then "OCR" it (apply optical character recognition) so it becomes searchable. Once the operator has captured all the individual pages, metadata is added to the book through a user interface. But the metadata-title, author, copyright, description, etc.-isn't necessarily added via human effort.

According to Brian Tingle, a technical leader for the Digital Special Collections (part of UC's CDL), much of the metadata is already cataloged as part of the online public access catalog (OPAC), known in the pre-digital era as the card catalog. Tingle's team works with a metadata object format, a standard for encoding and transmission of digital objects. "Those objects get turned into those formats and that's how we ingest them into our system," he explains. It's a different level of metadata that enables the linking together of objects, such as the pages of a book.

The automation of data capture is certainly something in which a former- NASA AI expert like Google's Clancy would excel. "If you look on our book reference page, you'll find related works identified; books with some relationship to the book you're looking at," he points out. "Or you'll find something we call ‘Popular Passages,' where we've extracted passages that are seminal or popular and mentioned in a number of different books. We use that as a way to link some of these books together." Achieving those connections, he says, is a programming job. "It may not be perfect, but this is 100 percent how we've done it. We don't have people picking out related books; we use lots of different signals. We just don't talk about which signals we use."

Beyond Automation

Not surprisingly, when the wizard behind the curtain is Google, the same kind of secrecy applies to search. Where UC's Tingle is highly forthcoming about the search product his team has developed at CDL-eXtensible Text Framework (XTF), which is based on Lucene, an Apache open source search engine-Google's Clancy prefers to focus on search outcomes. "We're all familiar with how search works on the web," says Clancy. "You type in a keyword phrase and suddenly it seems to find just the document you want; people create link structures that relate two things together. Well, as soon as you do that [with a book], you're giving us more information about that book. Eventually, you can imagine people linking to books from their web pages and other things. A book should be like the web: People should link directly into the book when it's relevant to them."

What form would that take for the person doing the search? According to Clancy, Google features to make book links possible have just begun to surface. One tool introduced (in the interface for books that are under no copyright restrictions) allows the reader to capture a section of a full page and copy it as text or an image to a Blogger page or Notebook, both Google services. "If there's a particular quote you like, you can go ahead and create a clipping of that quote, stick it on your blog and say something like, ‘This is where Abe Lincoln first asserted his desire to free the slaves.'"

Eventually, he says, authors will be able to represent "not just the conclusions and assertions they're making, but also the data upon which they base those assertions." He describes somebody reading David McCullough's 1776 (Simon & Schuster, 2005) being able to click through to primary sources such as George Washington's diary or the letters written by John Adams. But, "Now, I've gone beyond what Google is going to do," Clancy says. "As you open up all this content, these are research challenges for libraries, for the research communities, and for Google to say: How does this change scholarship?" Clancy envisions a day when users of online catalogs such as Melvyl (UC's OPAC) can find the record of a book and immediately link over to the content, whether that material is hosted by Google, UC, or some other institution with which the university has affiliated itself.

Still, without the profit-driven motives of a company such as Google (or Microsoft, for that matter), UC would never have had the funds to scan its materials on such a broad scale, maintains Chandler. "Strategically, it's really an important opportunity to take advantage of." Ultimately, she says, "We utilize the environment in which our faculty and students are working, and more and more obviously, it's digital."

When it comes down to it, then, this brave new world of book search probably needs to be understood as Book Search 1.0. And maybe participants should not get so hung up on quality that they obstruct the flow of an astounding amount of information. Right now, say many, the conveyor belt is running and the goal is to manage quantity, knowing that with time the rest of what's important will follow. Certainly, there's little doubt that in five years or so, Book Search as defined by Google will be very different. The lawsuits will have been resolved, the copyright issues sorted out, the standards settled, the technologies more broadly available, the integration more transparent.

"One thing we've learned," says Clancy: "We don't try to anticipate how people will make use of something. We're just at beginning of the marathon."

::WEBEXTRAS ::
The CIC's Richard Ekman weighs in on the Google Book Search controversy: www.campustechnology.com/articles/41199.

Dian Schaffhauser covers technology and business for various print and online publications.

E-Mail this page

Printable Format

Featured

How Colleges Are Connecting the Student Lifecycle to Improve Student Success

Colleges are aligning recruitment, advising, and student services into a connected student lifecycle. This coordination helps institutions support students more effectively and work more collaboratively.
Microsoft Intros New Agentic AI Security Multi-Model Defense System

A new multi-model agentic AI security system built by Microsoft's Autonomous Code Security team helped researchers find 16 new vulnerabilities across the Windows networking and authentication stack, the company anounced in a recent security blog post.
Microsoft Accelerates Focus on Quantum-Safe Security

Microsoft is speeding up its quantum-safe security timeline, saying advances in quantum computing and new federal requirements have pushed post-quantum cryptography from a future planning issue into an immediate engineering priority.
Cybersecurity Researchers Identify First Fully Autonomous AI-Driven Ransomware Attack

Threat researchers at cloud security firm Sysdig have disclosed what they describe as the first documented ransomware operation carried out end-to-end by an autonomous AI agent, with no human typing commands or directing individual steps once the attack was underway.