If You Want Information to Stay in the Deep Web, Use Technology-Not Lawyers
Imagine our world in 2004 if everyone who published a Web site or transmitted
an e-mail newsletter (like this one) with a hyperlink in it was legally required
to obtain "permission in writing" from the owner of the page at the
other end of the hyperlink. That might sound absurd to you, but it wasn't that
long ago that there were movements to require such permission. There are still
a few people out there who are outraged at the thought that others are linking
to pages deep inside their Web sites without asking permission first.
The issue bubbles to our attention every couple of years and I think it's going
to bubble up again later this year. Why? Because some companies are working
hard to explore the Deep Web and bring to light information many of us think
of as "hidden."
Trivia question: What do the Church of Scientology and Ticketmaster have in
common? In the past they've each brought lawsuits to try to keep people from
"deep linking" to parts of their Web sites. (They both lost, too.)
As recently as two years ago, the U.S. Ninth Circuit Court of Appeals briefly
ruled (and then took it back, thank goodness) that linking to a copyrighted
photo on a Web site without permission violates the copyright owner's "public
display" rights. Even one of my favorite entities in the world, National
Public Radio, recently made a strong effort to assert that anyone linking to
pages other than its front page needed to get its permission first. (It lost,
too.)
These battles keep being fought because (a) some people don't "get it"
and (b) because their lawyers can make money from them. And, believe it or not,
some of their lawyers don't get it either.
Those battles keep being won by users (and lost by content providers) because
the concept of "deep linking" is integral to the functioning of the
World Wide Web. (That phrase, World Wide Web is starting to sound kind of quaint,
isn't it?) Basically, it's kind of strange to put information in a publicly
accessible place with a publicly available address and then expect that you
can tell people not to share the address. If you extend the protection notion
out a bit more, then why stop with Web publishing? The ultimate offenders of
sharing public addresses of information are librarians and academics, with their
extensive footnoting and production of bibliographies. Let's jail them all!
What's going to bring up the issue again soon are the ongoing efforts to bring
up information from the "Deep Web." Most current search engines only
index and serve up the "Surface Web" that consists mostly of static
pages linked to each other by relatively unchanging hyperlinks. What used to
be called the "Invisible Web" (but Deep Web is a better phrase because
the stuff is not invisible, it's just deeper in the Web) includes a lot of stuff
that search engines don't find easily - like graphic images, or Word and Excel
documents, and especially information that is available through the Web but
is hosted in databases that we think require human brains to exploit.
Already, lots of things on the Web that were "invisible" via search
engines a few years ago are findable now.
For example, Google actually translates
PDF documents into HTML versions and provides them for users as part of the
search results. I recall, not too long ago, that I was helping a Web acquaintance
research an organization in terms of how to best present herself as a potential
vendor. I discovered the meeting minutes of an organization she had previously
worked with and that board of directors was discussing how overpriced she was!
(The board was shocked when I let them know that their minutes were so easily
available to anyone with a Web browser. She got some insight to their perspective
of her.)
But that's the point I want to make this week: We all have lots of stuff on
the Web that we don't view as findable. Some of it may be stuff we've forgotten
and left up on a server, but quite a bit of it might be in online databases
that until now search engines could not mine. Estimates are that there is between
500-1,000 times as much content on line in the Deep Web than in the Surface
Web.
Well, starting now or very soon, people will be able to enter search queries
and dredge a lot of Deep Web information up into the Surface Web. Check out
the Web sites of, or do a Google search on, Bright
Planet or Dipsie for examples of companies
working on this. Perhaps we can look forward to a series of exposes during 2004-05
that reveal lots of sensitive institutional data that people thought was hidden
- but is not. How would you feel, for example, if someone went to a "white
pages" search engine and was able to get links about your university's
staff contact information that went right into your database without using the
front door?
Institutions that want to think ahead a little bit might already be taking
a look around and inventorying the kinds of data they have in non-password-protected
databases that they'd just as soon prefer not be findable via public search
engines - and then moving that stuff behind better security. That's the technology
fix.
I am sure we'll see lots of media coverage about outraged data publishers who
bring suit against search engines that penetrate their Deep Web information.
That's the legal "fix." But the legal "fix" didn't work
before and I sure hope it d'esn't work now, or in the future.