Toronto's Now Magazine
Thursday, February 13, 1997

Oddball archivist aims to save the Net

The ambitious plan is to preserve ocean of chatter by saving literally everything

by Colman Jones, colman@ican.net

As you read this, millions of bytes of information are flowing across the Internet - texts, images and software - much of which will disappear into the black hole of cyberspace, lost forever.

But a group of computer scientists in California has now embarked on a most ambitious plan to preserve this ocean of electronic chatter - by saving literally everything online.

The project, known as the The Internet Archive, involves systematically making and storing copies of everything accessible over the Internet, including the World Wide Web, newsgroups and even downloadable software.

The archive's Web site claims it will "provide researchers, scholars, and others access to this vast collection of data (reaching 10 terabytes) and ensure the longevity of the information." (A terabyte is a million megabytes, an amount of data that would take 2,000 500-MB hard drives to store.)

Chief curator

The operation is being headed up by Brewster Kahle, a computer scientist who has adopted the role of chief curator of the world's digital history. "Our goal is to construct a digital library", Kahle says. "What a library does is stand behind published materials, so that when they fall off the newsstand or bookstore shelf there's still somebody there who will provide them."

The job of methodically scanning vast regions of cyberspace is done by an automated software program (a robot) using similar techniques employed by popular search engines like AltaVista, Excite, or OpenText.

Kahle has financed most of the project himself out of a small fortune he amassed when he sold his Web publishing company, Wide Area Information Servers (WAIS), to America Online. Since setting up the archive last year in a renovated U.S. army hospital in San Francisco, Kahle says he has already stored two terabytes of data, a mountain of information that's increasing at the rate of about half a gigabyte a day.

This unprecedented attempt to capture the ocean of data on the Net has, however, provoked concern among copyright specialists and privacy activists. "There are a lot of heavy-duty serious problems in terms of copyright", says Marcel Mongeon, an information technology lawyer and president of the Hamilton-Wentworth Freenet.

Mongeon points to the increasing number of Web pages sporting copyright notices, forbidding prospective surfers from plucking text or graphics to use for other purposes. "The (archivist) would be violating the copyright of that person by making a copy onto his system, and by letting other people have access to it."

In Detroit, Wayne State University law professor Jessica Litman, one of the founders of the Digital Future Coalition and a specialist in copyright law, says that the mere act of gathering and storing information is unlikely to attract litigation - search engines do it all the time. The critical question, she says, is who can then access this treasure trove of information.

Much trickier

Litman notes, "While it's an infringement to make that copy, I think it gets much trickier when he starts making it available to people. When you're making money, you should be asking permission and sharing some of that money with the author or the copyright owner of that material."

Even if Kahle doesn't charge users for subsequent access to this resource, he's still potentially liable for infringement, says Mongeon.

"Copyright is breached even by a mere copying of the material."

As it turns out, Kahle seems remarkably unperturbed by the legal challenges his project may eventually face.

"There are many issues that are unresolved in this whole area", he admits, "as I'm sure they were unresolved at the beginning of the print era. What we're trying to do is help demonstrate that there's value in the content of the Net that's beyond just the ephemeral nature new media tend to accumulate. They didn't collect a lot of the early printed work, they didn't collect a lot of the early films, and we think that was really too bad."

Asked directly if he's going to be charging people for access to the archive, Kahle says, "The basis of the copyright model is to protect the business model of the rights holders, and we're not trying to go into conflict with that at all. We're very sensitive to those issues and the wishes of rights owners. If people don't want us to have something, we'll purge it."

Kahle points out that anyone can insert codes, called HTML meta-tags, in their documents, which will prevent the robots from saving that particular page. All they need to do, he says, is to include a line that reads <META NAME= "ROBOTS" CONTENT ="NO ARCHIVE"> at the top of the Web pages they don't want archived (those anxious to have their site archived on Kahle's hard drives, on the other hand, can fill out a form at http://www.archive.org/invite.html).


Copyright © 1997 by Now Magazine. All Rights Reserved. Reprinted with permission.