Google Book Search: The Good, the Bad, & the Ugly
- By Dian Schaffhauser
- 01/01/08
Yes, Google is opening up whole new worlds for internet surfers
and researchers everywhere-even before the model is ready.
FORGET EVERYTHING YOU BELIEVE about Google's book digitization
project. Once you get past the freakishly high numbers bandied about, the two-dozen-plus distinguished
institutions that have signed on, the legal paranoia and the ultra-ultra-secret processes and technologies
involved-you'll find that Book Search (from the fifth most valuable company in America) is simply another
high-cost effort that is simultaneously visionary and crude. It doesn't even have to succeed in order to
impact the transformation of scholarship activities.
Here's the magic: Type "sonoma" and "mission" into books.google.com and choose "Full view" to eliminate
those books that haven't granted permission to be fully displayed or that are still in copyright because
they were published post-1923. About 550 titles show up, almost all of which you can view in text format
or as a PDF file. Perhaps the oldest reference that will appear is a volume titled An Overland Journey Round
the World During the Years 1841 and 1842 by Sir George Simpson, governor-in-chief of The Hudson's Bay
Company's territories. Google digitized the 1847 volume from the collection of the New York Public
Library, as the bar code on the cover shows (along with a small portion of what looks to be a human arm,
probably belonging to the person scanning that particular title).
LISTEN IN
Download our podcast interview with Robin Chandler, former director of data acquisitions for UC's California Digital Library.
As a reader, you might consider the
discovery of this long-lost tome a
modern-day miracle, akin to stumbling
on the bones of a previously unknown
dinosaur while digging in your garden.
And even though you never have to
leave your keyboard to read the contents,
you could click a link on the page,
enter your ZIP code, and find the nearest
library that has the book in its collection,
in case you're the kind of person
who likes to touch actual pages and take
in the perfume of old book stock.
Somehow (although the details are
mostly sketchy to those outside the
company), thousands of books are
working their way through the project
every business day to join the millions
of other publications already included
in Book Search. But let's take a look at
how Google is working with one of
its partners-the University of California system-to keep the process
humming along.
Google Book Search: Typical User View
GOOGLE BOOK SEARCH, which is still in beta after several years of
testing, offers the ubiquitous Google search box on its home page. It also has categories of
books as well as book cover images that refresh every time the home page is refreshed.
Once the user searches for a book and pulls up its record, one of two screens appears,
depending on whether the book is under copyright or not. Copyrighted books display a limited
number of pages ("snippets"), including the cover and back cover, the table of contents,
the index, and some content pages. For Steven Levitt and Stephen Dubner's Freakonomics (William Morrow, 2006), for instance, Google shares 29 pages of the 256-page tome. But
users can pull up a full-screen view of any of the 29 pages, write a review of the book, add it
to their online libraries, view the table of contents and some popular passages, search the
contents (only page numbers may show up), click through to other editions of the same
work, click to sources where you can buy the book or find it in a library (a link that connects
you to WorldCat), and read paid sponsor links.
For books no longer under copyright (current criteria: those works published earlier
than 1923, such as an edition of Adam Smith's The Wealth of Nations published in 1895
[T. Nelson and Sons], or books that the publishers have granted full viewing rights to), the
same features exist, though not all include paid sponsor links. For some books, users can
view a PDF edition (with download size shown) or view a plain-text version of each page.
When read in full-text mode, the non-copyrighted books allow Book Search users to view a
single page or facing pages, simultaneously, and make an annotation to be included in Google Blogger or Google Notebook. Perhaps more
helpfully, users can copy a link provided by Google, and forward it to others. When used correctly,
that function can take another reader to the same page and a specific clip captured by
the original reader with a dashed line marking off the content. Some books even include a
Google Maps mashup showing "Places mentioned in this book."
Quality vs. Quantity?
The UC system consists of more than 100
libraries spread across 10 campuses
around the state, containing more than 34
million volumes, which inhabit 3.6 million
square feet of library building space.
According to the UC-Berkeley library
website, in North
America the holdings of the state university
system are surpassed in scale only by
the Library of Congress. Just under a
third of the collection is housed in two
regional facilities, one located on the
campus of UCLA serving the southern
campuses; and the
other in an industrial area of Richmond
outside of San Francisco, serving the
northern schools. This makes the digital effort easier,
since neither Google nor the UC librarians
need to scurry from campus to campus
to obtain books to scan. Yet it also
suggests the possibility that Book Search
is actually a project all about numbers
over quality-an implication that neither
UC nor Google denies.
In fact, when UC signed its contract
with Google in July 2006, UC agreed to
provide no less than 2.5 million volumes
to the digitization effort over the course of
the agreement's six-year period. That's
just under 420,000 books a year, or less
than a tenth of the annual circulation of
materials throughout the UC library system,
which circulated 4.7 million items in
the 2005-2006 academic year. According
to the contract, after an initial ramp-up
of a couple of months of delivering 600
books per day, UC was obligated to crank
up delivery to 3,000 per day. And that,
says Robin Chandler, former director of
data acquisitions for UC's California Digital
Library (CDL), is exactly what UC's
goal has been. (This month, Chandler is
moving to a digital library position at the
University of California-San Diego.)
Chandler, who held the CDL role for
seven years, worked with a multitude of
libraries inside and outside the UC system,
to help guide their digitization
efforts. That includes Calisphere, a
public gateway to 150,000 digitized
items (including diaries, photographs,
political cartoons, and other cultural
artifacts representing the history and
culture of California), as well as the
Online Archive of California, which brings together historical
materials from a number of California
institutions including museums,
historical societies, and archives.
Do Authors Want to Be Digitized?
GREG SCHULZ, FOUNDER OF AND SENIOR ANALYST for The StorageIO Group-and interviewed for this article-is also the author of the book, Resilient Storage Networks (Digital Press, 2004). It doesn't bother him in the least that Google
Book Search might scan his entire book and make pieces of it available.
"I'm fine with that," he declares. "If it allows my work to be more widely
known so that people buy the book or engage in other related services, I'm
all for it. I'll gladly give up some book sales if it leads to something else."
The sticking point? When Google or other book-search projects "start leveraging
the work, or doing things with it. Then it gets into another dimension," says Schulz, pointing
to the recent Writers Guild strike. "At the center of that is: How do new media efforts affect royalties?
What happens?" In other words, a green light for now doesn't mean a green light forever.
"I've worked on a lot of projects that
have had complex partnership [components],"
she says. "So [CDL] asked me to
work on the mass digitization activities."
That mandate surfaced two years ago,
first with the Open Content Alliance, a nonprofit effort that's part of the Internet
Archive project; then
with Microsoft as
part of its Windows Live Book Search;
and most recently with Google Book
Search. Acting as the program liaison for
those projects, she says, consumed about
75 to 80 percent of her time at CDL.
How does UC deliver 3,000 books a
day to Google? It isn't by being overly
selective. And it doesn't involve rare
materials that aren't part of the circulating
collection. "All of the libraries are
talking about that, in the sense of what
might be the most interesting materials
to scan," says Chandler. "But I'll be
very frank: There's a real balance point
between volume and selection, especially
when looking at these numbers. UC
is trying to meet the needs of the contract
it's signed."
Ultimately, the library has to perform
bulk selection, "which means choosing
both in-copyright and out-of-copyright,"
she says. "So without having to worry
about publication dates and such, you're
literally able to clear shelves."
KIRTAS TECHNOLOGIES' APT BookScan 2400 Gold robotic
scanner is capable of digitizing 1,344 books a week.
The issue of copyright was something
Google stumbled over early in its founding
of Book Search (previously dubbed
the Google Print Library Project, even
though users apparently weren't allowed
to print anything). After a brief hiatus, the
site was modified to reflect the more
copyright-holder-friendly practices of
competitive offerings from Microsoft and
the Open Content Alliance. It comes
down to this: Full text is available for out-of-copyright materials and for copyrighted
books from publishers who allow it;
limited content is displayed for newer
books. Lawsuits are still pending.
As Chandler describes it, a staff member
removes the entire shelf of books,
places the books on a book truck, then
moves on to the next shelf, "until, essentially,
the quota for a day is reached. Then
they're checked out." Although
Google doesn't have a UC library
card per se, the books heading off
for scanning go through the same
checkout process as any volume
leaving the facility. Their bar codes
are read and a manifest is compiled,
"to be able to account for a day's
shipment," she explains. "It's very
important not to lose a book anywhere
along the way."
GOOGLE'S SCAN OF this page from an 1888 edition of Plato's The Trial and Death of Socrates suggests that humans-and human error-are a large part of the Book Search digitization process.
Behind Closed Doors…
Once the volumes move through the
checkout, they're purported to be
loaded onto another truck-one
which takes the volumes to an
undisclosed location where the
Google scanning facility is set up.
At that point, operations become a
black box. (It's possible that the
scanning occurs at the regional UC
facilities themselves, but UC
staffers aren't talking. Citing proprietary
concerns, Chandler declines to
answer questions about scanning operations,
and Dan Clancy-engineering
director for Google, in charge of leading
the Book Search team-is just as cagey.)
"When it first started, the technical
challenge was simply building a scanning
device that worked," Clancy says. "The
next technical challenge was being able to
run this scanning process at scale. We
would have been quite happy to use commercial
scanning technologies if they
were adequate to scale to this. We only
built our own scanning process because
that was the way to make this project
achievable for Google."
Book Search Today: A Researcher's View
SUSAN FARMA IS WORKING ON her master of arts in humanities at California State
University-Dominguez Hills. Because she's working full-time as an application manager for
the Los Angeles Philharmonic, Farma was thrilled when she learned about Google Book
Search. "Anything that saved time I considered a boon," she says. "But it has not provided the
help in research that I hoped." Her complaint: "I find it has severe limitations. For instance, if
the book is not in the public domain, the snippet view only shows you the word you searched
for and a few words around that word. There is no way to tell from the half a sentence that
[Google shows you], whether buying or borrowing the book in question would be able to
advance your thesis." In addition, says Farma, "books in the public domain online are few and
far between yet, and most of them are extremely old. [Book Search] works great for classics
in, say, literature, but not for individual subjects that you may be interested in researching."
Let's look at what Google may have
rejected as inadequate to do the job: The
APT BookScan 2400 Gold, the fastest
commercial offering from digitization
vendor Kirtas Technologies, scans books at a rate of
2,400 pages per hour. The product costs
between $100,000 and $175,000 and
includes two cameras, each pointing
downward at a 45-degree angle. In a
video on the company's website, a worker
is shown placing a book in a cradle and
making adjustments for its size. As both
pages are whipped through the scanning
simultaneously, a robotic arm that looks
like a waffle iron adheres to a page and
flips it for the next photo shoot to occur.
The question is: Is that fast enough to
keep up with Google's demands? At an
average size of 300 pages per book (a
count cited by UC's Chandler), Kirtas
equipment is capable of scanning eight
books an hour or 64 books in an eighthour
workshift. If scanning operations
were running around the clock and staffers never took breaks, called in
sick, or experienced equipment outages,
the tally would reach 1,344 books a
week per machine. Keeping up the pace
of those 15,000 books a week fed by UC
would require 12 of the Golds. Yet
apparently, Google is using something
else it considers superior.
When a Text Isn't Text
THE PAGES OF TEXT SHOWN through Book Search are actually images, not text. Although as
part of Google's digitizing process a conversion takes place to turn a scanned page into text,
the publicly offered results are less stellar than those made possible by the better-known OCR
applications such as Abbyy FineReader, which is used by compression software
provider LuraTech as part of its PDF conversion solution.
Frequently, an out-of-copyright book in Google will include a "View plain text" function,
but the user will be shown a page displaying only "No text" at the top-meaning that
Google was unable to convert that particular page into plain text. And if a user's keyword
search turns up such a page, Book Search still succeeds in locating and highlighting the
search terms, even if it can't seem to display the page in plain-text form. It's almost as if
two separate optical character recognition systems are in play: one for the search engine,
and another for converting scanned pages into plain text. This inconsistency may not trouble
most readers; but those who are print-disabled and need to use a screen reader or
convert the text to a speech reader, say otherwise.
Susan Gerhart holds a doctorate in computer science and has worked in research and management
in software engineering and technology transfer at Duke University (NC), NASA, the
National Science Foundation, USC's Information Sciences Institute, and Embry-Riddle Aeronautical
University (FL). Gerhart is also legally blind. As she points out in her blog, As Your
World Changes, her experiments in using Book Search
have turned up this anomaly, for settings that turned images off in her browser. "I got a snippet
of page text, a big empty block of missing image, and various book metadata, including where
to buy or borrow," she says. When she tried turning images on, "Ouch, was it bright," she recalls.
She writes: "There's nothing in, around, or any way out of the image into screen readable
mode. The image might as well have been a lake, a building, or porn for all the information
I could glean from it. I wondered why the omnipotent Google toolbar, gathering data about
my searches, and offering me various extra search information, could not also be the reader."
Gerhart is doubtless not alone in her frustration.
Linda Becker, the VP of sales and marketing
for Kirtas, doesn't believe that
Google has somehow created a faster digitization
process. "I do know what they're
doing, and I can't comment on it," she
says. "But what I can say is this: They're
not scanning faster, they're not digitizing
faster, and they don't have the quality
controls that the user deserves."
She may be right: In an ongoing online
debate about whether Google is using
robotic machinery or human beings to
flip the pages, bloggers have poked fun
at the search giant's quality control
methods (or lack of them) by posting
screenshots that reveal hands, fingers,
and arms in Book Search results. Becker
suggests that those screenshots may not
be anomalies. "If you go into Google
[Book Search] and look at any book,
you'll be able to see by the number of
body parts and fingerprints that [the
pages] are being turned manually."
Although Clancy won't describe the
actual process or equipment being used
for Book Search, he does point out that
one of the reasons he was recruited by
Google (from a lengthy career at
NASA) was because, "One, I had a
strong AI [artificial intelligence] background.
Two, I had a lot of experience
dealing with complex systems that had
lots of mechanical components along
with software components. And three, I
had the ability to do things to scale-an
important part of the Books project.
There are a lot of software complexities
[in that]," he concedes, "but also a lot of
people complexities."
Parlez Vous… Telugu?
BOOK DIGITIZATION PROJECTS aren't new. Carnegie Mellon University's (PA) Universal Digital
Library (UDL), which has been in the works since 2001, recently
announced that it had digitized 1.5 million books, including 971,594 titles in Chinese, 49,695
in Telugu, 39,265 in Arabic, and 20,931 in Kannada (Telugu and Kannada are both languages
of southern India), among other languages. That emphasis on multiple languages sets UDL
apart from other book digitizing efforts. The volumes are being scanned by universities and
organizations in multiple countries-the US, China, Egypt, and India-and are made available
free in three formats: HTML, TIFF, and DjVu (a PDF alternative). Although the details may differ,
the goal of the initiative sounds familiar to those who follow such matters: "to create a universal
library which will foster creativity and free access to all human knowledge."
There's more than that at stake, insists
Kirtas' Becker. The actual scanning
process isn't what's important in these
projects, she points out. "People get confused
between digitizing and scanning.
When you scan a book, you get what you
get. Digitizing is what Kirtas does. Once
we scan a book, we take it through a digitizing
process." That encompasses multiple
steps, she maintains: segmenting the
book (converting pages to black and
white from color, if that's how they started
out), performing background cleanup,
converting type size (such
as for applications for the visually
impaired), changing the book size for
printing purposes, and moving the digitized
content into other file formats such
as for online reading or PDF viewing.
"Right now," says Becker, "scanning is
irrelevant. What is relevant is this: How
do you create the highest quality digital
file with the smallest file size that's repurposable
so that you can extend the life of
it?" She believes that's what the Google
Book Search project is missing: a focus
on quality. "If you were to go to the
Google site, you'd see that one out of
every five pages is either missing, or has
fingers in it, or is cut off, or is blurry."
Still, avoiding at all costs the odd missing
page or disembodied digit may not be
a driving force behind Book Search right
now. UC's Chandler notes the qualitative
differences between the output produced
by the three mass digitization efforts she
was involved in at CDL. "The actual presentation
of the book is quite different. If
you look at what Google does, it's really
a bitonal representation. It's as if the book
were brand new, which is just to say that
the page is white [and] the ink for the font
is black. Whereas if you look at the
Microsoft [Windows Live Book Search]
presentation, it's a color image, so you get
the sense of it as an artifact."
In point of fact, it's possible that a good
number of books submitted for scanning
are out-and-out rejected in the Google
process. According to UC's Southern
Regional Library Facility annual report, it had scanned 33,503 volumes for the
Open Content Alliance project in one
year. (Neither Google's nor Microsoft's
numbers were provided.) An additional
16,988 volumes pulled off the shelves for
scanning were rejected, mostly because
they had tight margins, large foldouts, or
brittle paper. For every two books successfully
scanned, another one was
rejected and added to a list tagged "With
the hope of going back." There's little
reason to believe that Google's success
rate is dramatically different.
Storing a World of Files
Google's Clancy says the current database
of books in Book Search contains
"millions and millions" of volumes.
That requires a lot of storage work, no
new challenge for Google.
Greg Schulz, founder and senior analyst
for The StorageIO Group, observes, "Knowing
Google, they're storing it the same way
they store all their other data: They're
using clustered nodes of processors
with drives in them-a Google storage
grid." The Google data center model,
which has been well documented and
marveled over, is to leverage commodity
servers (X86 and AMD) in volume.
These are servers, says Schulz, "that
you can buy very, very inexpensively,
and that give you a good balance of performance
and storage capacity for a low
cost." And when Schulz says volume, he
means tens-possibly hundreds-of
thousands of servers.
On top of that hardware runs Google
software: the Google File System, the
Google storage management tools, and
other layers-"for monitoring and making
sure the hardware is running efficiently,
that it's healthy, and that the
data is protected. That way, if a server
fails, the others can pick up the workload,"
Schulz says.
The actual data recorded by the
scanning process is probably maintained
in Bigtable, a distributed storage
system for structured data. As described
in a white paper published by Google
in 2006, "Bigtable is
designed to reliably scale to petabytes of
data and thousands of machines. Bigtable
has achieved several goals: wide applicability,
scalability, high performance,
and high availability." While the paper
doesn't mention Book Search by name, it
does state that Bigtable is used by more
than 60 Google products and projects.
Although the equipment and software
for Book Search has evolved to become a
string of proprietary systems, Clancy
insists that his company uses commercial
standards where they work. That includes
image compression standards like JPEG
(and its successor JPEG 2000), PNG,
TIFF, and PDF.
Processing Book Files
Image compression isn't a small issue in
mass digitization projects, says Mark
McKinney, VP of business development
for LuraTech, a
company that produces compression
software. Besides consuming storage,
the files developed out of scanned books
need to be delivered across the web with
no perceivable delays. LuraTech powers
the work done by the Open Content
Alliance, which applies the JPEG 2000
format to compression; JPEG 2000 is a
powerful long-term archival format that
reduces a large color file to about a hundredth
of its original size, says McKinney.
In that effort, he says, workers run
the process from digitization stations
called "Scribes" that take the picture
from a page, color-correct it, and then
"OCR" it (apply optical character
recognition) so it becomes searchable.
Once the operator has captured all the
individual pages, metadata is added to
the book through a user interface. But
the metadata-title, author, copyright,
description, etc.-isn't necessarily added
via human effort.
According to Brian Tingle, a technical
leader for the Digital Special Collections
(part of UC's CDL), much of the metadata
is already cataloged as part of the
online public access catalog (OPAC), known in the pre-digital era as the
card catalog. Tingle's team works
with a metadata object format, a standard
for encoding and transmission
of digital objects. "Those objects get
turned into those formats and that's
how we ingest them into our system,"
he explains. It's a different level of
metadata that enables the linking
together of objects, such as the pages
of a book.
The automation of data capture is
certainly something in which a former-
NASA AI expert like Google's Clancy
would excel. "If you look on our book reference
page, you'll find related works
identified; books with some relationship
to the book you're looking at," he points
out. "Or you'll find something we call
‘Popular Passages,' where we've extracted
passages that are seminal or popular
and mentioned in a number of different
books. We use that as a way to link some
of these books together." Achieving those
connections, he says, is a programming
job. "It may not be perfect, but this is 100
percent how we've done it. We don't have
people picking out related books; we use
lots of different signals. We just don't talk
about which signals we use."
Beyond Automation
Not surprisingly, when the wizard behind
the curtain is Google, the same kind of
secrecy applies to search. Where UC's
Tingle is highly forthcoming about the
search product his team has developed
at CDL-eXtensible Text Framework
(XTF), which is based on Lucene, an
Apache open source search engine-Google's Clancy
prefers to focus on search outcomes.
"We're all familiar with how search
works on the web," says Clancy. "You
type in a keyword phrase and suddenly it
seems to find just the document you
want; people create link structures that
relate two things together. Well, as soon
as you do that [with a book], you're giving
us more information about that book.
Eventually, you can imagine people linking
to books from their web pages and
other things. A book should be like the
web: People should link directly into the
book when it's relevant to them."
What form would that take for the person
doing the search? According to Clancy,
Google features to make book links
possible have just begun to surface. One
tool introduced (in the interface for books
that are under no copyright restrictions)
allows the reader to capture a section of a
full page and copy it as text or an image to
a Blogger page or
Notebook,
both Google services. "If there's a particular
quote you like, you can go ahead and
create a clipping of that quote, stick it on
your blog and say something like, ‘This is
where Abe Lincoln first asserted his
desire to free the slaves.'"
Eventually, he says, authors will be
able to represent "not just the conclusions
and assertions they're making, but also
the data upon which they base those
assertions." He describes somebody
reading David McCullough's 1776 (Simon & Schuster, 2005) being able to
click through to primary sources such as
George Washington's diary or the letters
written by John Adams. But, "Now, I've
gone beyond what Google is going to do,"
Clancy says. "As you open up all this content,
these are research challenges for
libraries, for the research communities,
and for Google to say: How does this
change scholarship?" Clancy envisions a
day when users of online catalogs such as
Melvyl (UC's OPAC)
can find the record of a book and immediately
link over to the content, whether
that material is hosted by Google, UC, or
some other institution with which the university
has affiliated itself.
Still, without the profit-driven motives
of a company such as Google (or
Microsoft, for that matter), UC would
never have had the funds to scan its materials
on such a broad scale, maintains
Chandler. "Strategically, it's really an
important opportunity to take advantage
of." Ultimately, she says, "We utilize the
environment in which our faculty and students
are working, and more and more
obviously, it's digital."
When it comes down to it, then, this
brave new world of book search probably
needs to be understood as Book Search
1.0. And maybe participants should not
get so hung up on quality that they
obstruct the flow of an astounding
amount of information. Right now, say
many, the conveyor belt is running and
the goal is to manage quantity, knowing
that with time the rest of what's important
will follow. Certainly, there's little
doubt that in five years or so, Book
Search as defined by Google will be very
different. The lawsuits will have been
resolved, the copyright issues sorted out,
the standards settled, the technologies
more broadly available, the integration
more transparent.
"One thing we've learned," says Clancy:
"We don't try to anticipate how people
will make use of something. We're just at
beginning of the marathon."
::WEBEXTRAS ::
The CIC's Richard Ekman weighs
in on the Google Book Search controversy:
www.campustechnology.com/articles/41199.
Dian Schaffhauser covers technology
and business for various print and
online publications.