Home > Google Book Search: The Good, the Bad, & the Ugly

Digital Libraries

Google Book Search: The Good, the Bad, & the Ugly

1/1/2008

In point of fact, it's possible that a good number of books submitted for scanning are out-and-out rejected in the Google process. According to UC's Southern Regional Library Facility annual report, it had scanned 33,503 volumes for the Open Content Alliance project in one year. (Neither Google's nor Microsoft's numbers were provided.) An additional 16,988 volumes pulled off the shelves for scanning were rejected, mostly because they had tight margins, large foldouts, or brittle paper. For every two books successfully scanned, another one was rejected and added to a list tagged "With the hope of going back." There's little reason to believe that Google's success rate is dramatically different.

Storing a World of Files

Google's Clancy says the current database of books in Book Search contains "millions and millions" of volumes. That requires a lot of storage work, no new challenge for Google.

Greg Schulz, founder and senior analyst for The StorageIO Group, observes, "Knowing Google, they're storing it the same way they store all their other data: They're using clustered nodes of processors with drives in them-a Google storage grid." The Google data center model, which has been well documented and marveled over, is to leverage commodity servers (X86 and AMD) in volume. These are servers, says Schulz, "that you can buy very, very inexpensively, and that give you a good balance of performance and storage capacity for a low cost." And when Schulz says volume, he means tens-possibly hundreds-of thousands of servers.

On top of that hardware runs Google software: the Google File System, the Google storage management tools, and other layers-"for monitoring and making sure the hardware is running efficiently, that it's healthy, and that the data is protected. That way, if a server fails, the others can pick up the workload," Schulz says.

The actual data recorded by the scanning process is probably maintained in Bigtable, a distributed storage system for structured data. As described in a white paper published by Google in 2006, "Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability." While the paper doesn't mention Book Search by name, it does state that Bigtable is used by more than 60 Google products and projects.

Although the equipment and software for Book Search has evolved to become a string of proprietary systems, Clancy insists that his company uses commercial standards where they work. That includes image compression standards like JPEG (and its successor JPEG 2000), PNG, TIFF, and PDF.