Full Story
Like other online publishers, The New York Times charges readers to access articles on its Web site. But why pay when you can use Google instead?
Through a caching feature on the popular Google search site, people can sometimes call up snapshots of archived stories at NYTimes.com and other registration-only sites. The practice has proved a boon for readers hoping to track down Web pages that are no longer accessible at the original source, for whatever reason. But the feature has recently been putting Google at odds with some unhappy publishers.
"We are working with Google to fix that problem--we're going to close it so when you click on a link it will take you to a registration page," said Christine Mohan, a spokeswoman at New York Times Digital, the publisher of NYTimes.com. "We have established these archived links and want to maintain consistency across all these access points."
Google offers publishers a simple way to opt out of its temporary archive, and scuffles have yet to erupt into open warfare or lawsuits. Still, Google's cache links illustrate a slippery side of innovation on the Web, where cool new features that seem benign on the surface often carry unintended consequences.
The issue is particularly relevant at Google, a company that prides itself on creativity and routinely floats trial balloons for new features and services. Its culture of innovation may become increasingly risky as Google, which draws millions of visitors to its site daily and redirects them to others through secretive search formulas, cements its position as one of the most popular and powerful companies on the Web.
At the heart of Google's caching dilemma lies a thorny legal problem involving a core Web technology: When is it acceptable to copy someone else's Web page, even temporarily?
A phantom life for dead pages
Google's cache, a feature introduced in 1997, is unique among commercial search engines, but it's not unlike other archival sites on the Web that keep digital copies of Web pages. Google's relatively little-known feature lets people access a copy of almost any Web page, within Google's own site, in the form it was in whenever last indexed by the search giant. That could mean the page accessed is either minutes or months old, depending on when Google last crawled it.
Unlike formal Web archive projects, Google says its cache feature does not attempt to create a permanent historical record of the Web. Rather, the company actively seeks to delete dead links; once a Web page disappears, the search engine seeks to purge that record and any related cached page as quickly as possible.
Still, Google's cached pages have proven to be a treasure trove for investigators seeking to recover data pulled from public Web sites. In one high-profile example, security and privacy expert Richard Smith copied Web pages detailing the backgrounds of Dr. John Poindexter, head of the Pentagon's Information Awareness Office (IAO), and other officials, from the Google cache days after they were removed from the IAO Web site. The pages were deleted after public reports surfaced on the office's development of a massive computer system to spy on Americans and potential terrorists.
"When something's been yanked, Google cache is a good place to grab it and save for posterity, because you don't know how long Google will have it," said Smith.
Google claims its caching feature benefits Web surfers by letting them access a site that may be malfunctioning or offline. Also, its cached pages highlight terms that match a search query "to make it easier for users to find relevant information," according to a spokesman at the Mountain View, Calif.-based company.
Lawyers, start your search engines
As seemingly benign and beneficial as it is, some Web site operators take issue with the feature and digitally prevent Google from recording their pages in full by adding special code to their sites. Among other arguments, they say that cached pages at Google have the potential to detour traffic from their own site, or, at worst, constitute trademark or copyright violations. In the case of an out-of-date news page in Google's cache, a Web publisher could even face legal troubles because of false data remaining on the Web but corrected at its own site.
For this reason, search experts and copyright lawyers expect the issue to come up in a court of law, joining the leagues of copyright disputes that have surfaced because of technology innovation.
"It's very much an issue that has yet to be tested, and I fully expect that it will be," said Danny Sullivan, industry pundit and editor of Search Engine Watch.
Admittedly, Google's cache is like any number of backdoors to information on the Web. For example, proxy servers can be the keys to a site that is banned by a visitor's hosting Web server. And technically, any time a Web surfer visits a site, that visit could be interpreted as a copyright violation, because the page is temporarily cached in the user's computer memory.
The digital universe is constantly changing, but its content can be either fleeting or permanent. Several Web sites, including the Internet Archive Wayback Machine and the Sept. 11 Digital Archive, have surfaced to preserve information on the Web and to keep permanent historical accounts of events and Web pages. Yet, many more pages, and even those in Google's cache, are eventually lost in the digital ether. The average lifespan of a Web site is 100 days, according to estimates by the Internet Archive.
Still, copyright lawyers and industry experts say that there are legally uncharted waters around a commercial caching service.
"Many of us copyright lawyers have been waiting for this issue to come up: Google is making copies of all the Web sites they index and they're not asking permission," said Fred von Lohman, an attorney at the Electronic Frontier Foundation. "From a strict copyright standpoint, it violates copyright."
Most search engines make a statistical record of a Web page when they "spider" it, or use "robots" to scan the page for meaning or context to related queries. For example, the engine can point to specific information contained on a page that's related to a search term, but it often doesn't have the complete picture of the page. Google goes one step beyond, however, by taking a digital picture of pages and making it available to visitors in cached links. Those pictures exist temporarily on its site until the next time Google crawls that particular page, which can happen in a few days or in six weeks or more.
Legally, what could differentiate Google from other archival sites that record pages is that it is a commercial site and that it has enormous scope and influence on the Web.
But what's kept the feature off most Web sites' radar is that, anecdotally, most people don't click on the cache. Even Google says people only "occasionally" click its cached links. If more people did, Web publishers might lose visitors--and potentially advertising dollars, which no one can afford to lose as Web publishing gets back on its feet.
Practically speaking, Web sites can "opt out," or include code in their pages that bars Google from caching the page. A tag to exclude "robots" such as "www.nytimes.com/robots.txt" or "NOARCHIVE" typically does the job. And that's largely what's kept the cache feature from being controversial.
Search Engine Watch's Sullivan said that, even though some publishers are wary of the caching feature, many don't block Google's robots for fear of losing favor in the company's powerful search rankings. He said some Webmasters believe there's a stigma associated the "no cache" tag, because many sites that use it have been accused of attempting to use banned methods to manipulate Google's rankings. Google said the "no cache" tag does not affect rankings.