If You Want Information to Stay in the Deep Web, Use Technology-Not Lawyers

Imagine our world in 2004 if everyone who published a Web site or transmitted an e-mail newsletter (like this one) with a hyperlink in it was legally required to obtain "permission in writing" from the owner of the page at the other end of the hyperlink. That might sound absurd to you, but it wasn't that long ago that there were movements to require such permission. There are still a few people out there who are outraged at the thought that others are linking to pages deep inside their Web sites without asking permission first.

The issue bubbles to our attention every couple of years and I think it's going to bubble up again later this year. Why? Because some companies are working hard to explore the Deep Web and bring to light information many of us think of as "hidden."

Trivia question: What do the Church of Scientology and Ticketmaster have in common? In the past they've each brought lawsuits to try to keep people from "deep linking" to parts of their Web sites. (They both lost, too.) As recently as two years ago, the U.S. Ninth Circuit Court of Appeals briefly ruled (and then took it back, thank goodness) that linking to a copyrighted photo on a Web site without permission violates the copyright owner's "public display" rights. Even one of my favorite entities in the world, National Public Radio, recently made a strong effort to assert that anyone linking to pages other than its front page needed to get its permission first. (It lost, too.)

These battles keep being fought because (a) some people don't "get it" and (b) because their lawyers can make money from them. And, believe it or not, some of their lawyers don't get it either.

Those battles keep being won by users (and lost by content providers) because the concept of "deep linking" is integral to the functioning of the World Wide Web. (That phrase, World Wide Web is starting to sound kind of quaint, isn't it?) Basically, it's kind of strange to put information in a publicly accessible place with a publicly available address and then expect that you can tell people not to share the address. If you extend the protection notion out a bit more, then why stop with Web publishing? The ultimate offenders of sharing public addresses of information are librarians and academics, with their extensive footnoting and production of bibliographies. Let's jail them all!

What's going to bring up the issue again soon are the ongoing efforts to bring up information from the "Deep Web." Most current search engines only index and serve up the "Surface Web" that consists mostly of static pages linked to each other by relatively unchanging hyperlinks. What used to be called the "Invisible Web" (but Deep Web is a better phrase because the stuff is not invisible, it's just deeper in the Web) includes a lot of stuff that search engines don't find easily - like graphic images, or Word and Excel documents, and especially information that is available through the Web but is hosted in databases that we think require human brains to exploit.

Already, lots of things on the Web that were "invisible" via search engines a few years ago are findable now. For example, Google actually translates PDF documents into HTML versions and provides them for users as part of the search results. I recall, not too long ago, that I was helping a Web acquaintance research an organization in terms of how to best present herself as a potential vendor. I discovered the meeting minutes of an organization she had previously worked with and that board of directors was discussing how overpriced she was! (The board was shocked when I let them know that their minutes were so easily available to anyone with a Web browser. She got some insight to their perspective of her.)

But that's the point I want to make this week: We all have lots of stuff on the Web that we don't view as findable. Some of it may be stuff we've forgotten and left up on a server, but quite a bit of it might be in online databases that until now search engines could not mine. Estimates are that there is between 500-1,000 times as much content on line in the Deep Web than in the Surface Web.

Well, starting now or very soon, people will be able to enter search queries and dredge a lot of Deep Web information up into the Surface Web. Check out the Web sites of, or do a Google search on, Bright Planet or Dipsie for examples of companies working on this. Perhaps we can look forward to a series of exposes during 2004-05 that reveal lots of sensitive institutional data that people thought was hidden - but is not. How would you feel, for example, if someone went to a "white pages" search engine and was able to get links about your university's staff contact information that went right into your database without using the front door?

Institutions that want to think ahead a little bit might already be taking a look around and inventorying the kinds of data they have in non-password-protected databases that they'd just as soon prefer not be findable via public search engines - and then moving that stuff behind better security. That's the technology fix.

I am sure we'll see lots of media coverage about outraged data publishers who bring suit against search engines that penetrate their Deep Web information. That's the legal "fix." But the legal "fix" didn't work before and I sure hope it d'esn't work now, or in the future.

Featured