The Patry Copyright Blog: The Way Back Machine and Robots.txt

Tuesday, July 12, 2005

The Way Back Machine and Robots.txt

On July 8th, a complaint was filed in the United States District Court for the Eastern District of Pennsylvania, Healthcare Advocates, Inc. v. Harding, Early, Follmer & Frailey, et al. This is such an extraordinary document that I will break with my usual practice of not commenting on complaints or motions. Those who decry the DMCA as an (attempted) tool of oppression will find more than ample support in this effort. Other laws are implicated too, including some I venture to guess most IP lawyers have never heard of at least in the IP context, for example, a Greta Garbo like claim for "Intrusion upon Seclusion." Others, such as the Computer Fraud & Abuse Act and trespass to chattels have become better known recently but are invoked here in a novel way, to say the least. In my opinion (and all this is opinion whether denominated as such or not), the Healthcare Advocates complaint represents a misuse of the legal process.

The complaint appears to be the result of an earlier failed suit brought by Kevin Flynn and Healthcare Advocates (Flynn is the President) against Health Advocate, Inc. and others for various trademark and related type claims. Three opinions in that case should be noted, 2004 U.S. Dist. LEXIS 293 (E.D. Pa. January 13, 2004)(dismissing a number of claims), 2004 U.S. Dist. LEXIS 12536 (E.D. Pa. July 8, 2004)(denying plaintiff's motion to amend complaint and denying defendant's motion for in camera review of the documents in question), and 2005 U.S. Dist. LEXIS 1704 (E.D. Pa. Feb. 8, 2005)(dismissing remaining federal claims and declining to exercise pendent jurisdiction over state fraud claim).

During the investigation of plaintiff's claims, a law firm for some of the defendants utilized the not-for-profit Internet Archive Wayback Machine. The Wayback Machine lets one access archived versions of websites. You type in the URL, select a date range, and presto, you can surf an archived version of the web page in question. It is a phenomenally important archive, useful to people throughout the world, including parties in lawsuits who want to find out what their adversary was saying in the past on a website that has been updated or revised potentially hundreds of times since the events in question. The Wayback machine contains about 1 petabyte of data, more than that in the Library of Congress, even though the archiving only began in 1996. The archiving is accomplished by the Alexa webcrawler.

The Wayback machine is not, however, interested in archiving material website administrators don't want archived, so it has developed a number of ways for people to say, "Please don't collect our stuff." You could telephone the Internet Archive and tell them not to. Or, you can utilize the SRE (Standard for Robot Exclusion) to specify files or directories that cannot be crawled. This is accomplished by a file called robots.txt. (Here is a short article on the Wayback machine and robot exclusion from Wikipedia, and here is a more technical explanation, Robots.txt.) Use of robots.txt is entirely voluntary and many webcrawlers do not utilize it, although the Alexa webcrawler is programmed to obey the robots.txt instructions, and in fact is constructed so as to block, retroactively, files in existence before the instructions were inserted.

Back to the Healthcare Advocates case. The complaint in the earlier suit against Health Advocate, Inc. was filed in June 26, 2003. Healthcare Advocates had been operating a website, www.healthcareadvocates.com since 1998. In July 8, 2003, the robots.txt instructions were inserted. The next day, it is alleged, defendant's law firm tried to access archived Healthcare Advocates website material. In the court's July 8, 2004 opinion, an allegation is recited that between July 8, 2003 and July 15, 2003, 849 attempts were made to access the archived information, of which about 112 attempts were successful. Presumably, all of the material was pre-July 8, 2003 information.

Plaintiff sought to amend the complaint to bring claims against the law firm for this activity, but the court denied the motion. After plaintiff's complaint was dismissed as noted above, this new complaint against the law firm, its members and employees, and the Internet Archive was brought last Friday, July 8th.

There are 12 counts, too many to recite on this already too long blog. I will only talk about one, the DMCA claim, an alleged violation of Section 1201(a): "No person shall circumvent a technological measure that effectively controls access to a work protected under this title." It is alleged that the robots.txt denial text string is a technological circumvention measure and that defendant law firm circumvented it. This claim, in my opinion, is factually and legally wrong. Factually, at least from the complaint, it does not appear that the law firm "circumvented" anything, if by circumvent we mean devised a mousetrap to bypass the denial text string. Instead, it seems as if defendant kept banging on the URL until, for whatever reason, the denial failed to be recognized. This is like going down a row of houses and trying doors to see if they are open. If they aren't you move on until you find one that is. If it is open you walk in, but you certainly haven't circumvented an access control mechanism.

But as importantly, I don't see how the robots.txt can meet the 1201(b)(2)(B) definition of a technological measure: it is a voluntary protocol, operated if at all not by the copyright owner but by a third party, and not all third parties have agreed to use it. The definition of a technological measure is one that "effectively protects a right of a copyright owner ... if the measure, in the ordinary course of its operation, prevents, restricts, or otherwise limits the exercise of a right of the copyright owner under this title."

In the ordinary course of the operation of plaintiff's website only those webcrawlers that had voluntarily agreed to do so would restrict access, and many don't. That can hardly meet the effective protection standard contemplated in the definition. And as a policy matter, plaintiff's theory would encourage good government archivists like the Internet Archive not to use voluntary measures on pain of a DMCA violation. Nor can one say that there was any quid pro quo here: the webpages in question were publicly available long before plaintiff decided to restrict access in conjunction with a much later filed lawsuit. And that is the worst policy of all.

38 comments:

Anonymous1:03 PM
I'd like to suggest a different interpretation of their DMCA claim (while acknowledging that the complaint is not clear): that the robot.txt file operates as a TPM as used by the Internet Archive. The standing provisions of the DMCA have been interpreted broadly, so perhaps the plaintiff here is arguing that the Internet Archive has implemented a TPM that controls access to its archived materials. The robot.txt file is intended to block external access to these materials, and was bypassed by the defendants. (I'll admit, this sounds like the Archive's claim to bring, not the plaintiffs', but the DMCA's standing provision has been stretched before.)

I think the claim still fails for the other reason you note. But I don't think the complaint need necessarily be construed as arguing that robot.txt is a TPM generally.
ReplyDelete
Replies
joebeone2:12 PM
The complaint seems to have exceeding bandwidth... if someone would care to send a copy to joehall@pobox.com, I'd be happy to repost it and put a link here in the comments.
ReplyDelete
Replies
Anonymous2:57 PM
I uploaded it:

http://www.kevinwimberly.com/Healthcare_Advocates_v._Harding_Complaint__FINAL.pdf.pdf
ReplyDelete
Replies
Anonymous7:23 PM
I have a pet peeve on an analogy you used and that is frequently used by others when dealing with Internet security issues: "This is like going down a row of houses and trying doors to see if they are open. If they aren't you move on until you find one that is. If it is open you walk in, but you certainly haven't circumvented an access control mechanism"

Almost every state considers the unprivileged opening of an unlocked front door to be an unlawful entry. It is most certainly a trespass. The point of unlawful entry or trespass occurs at the moment the door is opened and the threshold is crossed. The whole purpose of a house, locked or not, is to protect the contents and to act as an "an access control mechanism." The DMCA anti-circumvention provisions - - conceptually a stupid idea for a copyright act (you might as well bootstrap all contractual disputes with respect to a copyrighted work into the copyright law, too)- - are designed to protect copyrighted works when the owner puts them inside a house, so to speak. Even leaving the door ajar doesn't suggest that anyone can come in and help themselves to the silverware.
ReplyDelete
Replies
William Patry7:41 PM
Anonymous:

You are right to take me to task on the precise analogy. A better way of putting it would be doors to buildings that are, unless locked, open to the public.
ReplyDelete
Replies
Anonymous8:35 PM
Is ROT13 an access control mechanism? Because this seems even less complicated than that.
ReplyDelete
Replies
Anonymous3:51 AM
ROT13 is an access control mechanism.
And following the same logic, encrypting twice with ROT13 makes it even more so :-) .
ReplyDelete
Replies
William Patry11:02 AM
Here's a link to an article about the suit in today's NY Times: http://www.nytimes.com/2005/07/13/technology/13suit.html
ReplyDelete
Replies
Anonymous11:49 AM
This adds an interesting twist...
ReplyDelete
Replies
Anonymous12:24 PM
To make a somewhat different version of Fred's point: I see the complaint as arguing that plaintiff's robots.txt file was a TPM because, in the normal course of affairs, it prevented people from accessing plaintiff's content via the Internet Archive, and therefore it "in the ordinary course of its operation . . . limit[ed]" an exercise of a right of the copyright holder.
ReplyDelete
Replies
Anonymous12:41 PM
Oops -- like Bill, I quoted 1201(b) rather than 1201(a). The relevant definition of a TPM is a measure that "in the ordinary course of its operation, requires the application of information, or a process or a treatment, with the authority of the copyright owner, to gain access to the work." The theory of the complaint, I think, is that the robots.txt file does that because it normally precludes access via the Internet Archive. But does that hold together?
ReplyDelete
Replies
Harry Metcalfe2:03 PM
How can a convention - adherence to which is optional, by definition - be considered a protection measure? Isn't that rather like having a lock with a lever that says 'open me', as well as having a hole for a key?

Robots.txt is advisory, not mandatory...
ReplyDelete
Replies
Anonymous2:30 PM
The problem seems to be the retroactive hiding of pages by the Internet Archive, something that isn't specified by the robots.txt definition. The normal definition of blocking access via robots.txt is that *future* crawls will skip those pages, not that all previous references to that page are deleted. Why the Internet Archive takes this additional step is beyond me.

In any case, it is hardly an effective TPM, since they could have gotten the page from the Google cache, which doesn't retroactively delete.

I hope that this legal action sets a precedent on this matter, so that future lawsuits of this nature are harder to bring.
ReplyDelete
Replies
William Patry3:53 PM
There is a text version of the complaint at: http://www.ip-wars.net/story/2005/7/12/185442/034

On Fred von Lohman's and Jon Weinberg's point, isn't that reducing the definition to the point of absurdity: so long as something is a TPM for one person in the DMCA applies. And then of course it will always be effective for that one person. I had assumed, whether rightly or not, that Congress was referring to a TPM of general application.

BTW, when still working on the Hill, a company that shall go nameless but that provides anti-circumvention protection for the motion picture industry asked us to put an anti-circumvention measure in the GATT. We said, but don't you have a patent? They said yes. We said, and isn't what you're complaining about (at that time) an infringement of your patent? They said yes. We said, so why not sue for patent infringement? They said its too expensive. It was a nice lesson that some in the private sector view Congress as a cheap alternative to patent litigation.
ReplyDelete
Replies
Mike4:24 PM
I don't care for the complaint as I think that they got it legally and technically wrong.

That said, however, I think that there is something important that is often overlooked in these archiving schemes which does not sit right. Under the terms of WbM's use, the author has to opt-out not opt-in. Whether you like the DMCA or not, that doesn't sound like traditional copyright at all. The WBM isn't just excerpting sections, it is copying verbatim everything and redistributing it. Worse yet, it may be "taking" content and author's may not even know it.
ReplyDelete
Replies
William Patry4:47 PM
mmmbeer:

About 5 years ago, I litigated in the SDNY and Second Circuit Register.com v. Verio (for plaintiff).Opt-in versus opt-out was a big issue for terms of use and privacy in that case.

With publicly available websites (by which i mean non-password or otherwise protected), though, it seems there should be a healthy implied license, understanding that the license will be defined by a number of things, like custom. I would hope that at least something like the Internet Archive would fall within such a license. Amendment to 17 USC 108 is another option.
ReplyDelete
Replies
Mike6:14 PM
If the WBM is relying on 17 USC 108, they're going to run into problems regarding copying. While a library or an archive in a traditional sense could conceivably avail themselves of such a thing and then make it open to the public, it's difficult to see how a digital archive like archive.org my equally be safe harbored. The language states pretty plainly that it protects the archiver when reproducing "no more than one copy or phonorecord of a work". We know from some (c) law that computers make many copies in the process. This is especially true for a site that then serves it to the public (retrieving a page once by one client then creates two--one on archive.org and the other in the user's cache).

That seems mighty dangerous.
ReplyDelete
Replies
William Patry8:43 PM
I agree with mmmbeer that 108 doesn't cover everything the Internet Archive is doing; my point was that perhaps an amendment to 108 might be an approach. And with such a proposal, we could have a good public debate with policy makers about what types of uses we should permit.
ReplyDelete
Replies
Anonymous11:31 PM
Circumvention of WHAT? A robots.txt file is merely netiquette, a means of asking the bots to "please leave this one alone." While most of the major search engines' crawlers abide by this, there are plenty of others that do not. It was never understood that obeying a robots file was mandatory. Will we go down the DMCA slippery slope and see actions against anyone whose crawler "circumvents" someone's robots file? Give me a break.
ReplyDelete
Replies
Mike9:17 AM
kevin -
For starters, that's why I think that the complaint is technically (as in technologically) deficient. But, the more I think about it, the more I think that what archive.org does IS probably wrong.

Conceptually, archive.org's policy seems to place the "burden" on the wrong party. The burden, as I understand it, shouldn't be on the copyright holder to do anything more to prevent the wholesale, exact copying and distribution of everything on a site (adding a robots.txt or calling them or e-mailing them)--except insofar as litigation is required. We, in fact, would expect parties engage in similar "real world" behavior to obtain either consent or license. As noted above, 17 USC has a number of exceptions, but they all seem deficient or simply not applicable.

As Mr. Patry suggested, archive.org might get a pass on an implied license of some sort (indeed, one could evidence a number of sites that do similar things: google images, etc). However, I'm not sure that even an implied license would go so far as to permit wholesale copying of everything on a site, in repetition, with unlimited reproduction and distribution rights. That seems a bit beyond what someone might expect another has the rights to do with their property. At least, it would seem that a good lawyer could poke holes in any such defense pretty quickly. More plausible might be a judges more liberal reading of archiver exceptions.

Sounds like a law review article waiting to happen.
ReplyDelete
Replies
William Patry10:03 AM
I agree with Kevin Brady on the netiquette remark for the DMCA claim, and I also agree with mmmbeer with his concerns about the copyright issue: one can, after all, defeat an implied license defense to copyright infringement just by saying "No more from now on out" and that is what the denial string request was. There remains, of course, a fair use defense.
ReplyDelete
Replies
Anonymous10:34 AM
Mmmbeer:

What you and Mr. Patry said about an implied license sounds like a good idea. Be interesting to see how the court interprets this case and if they apply such a concept. However, I think your idea of an "archiver exception" is better. There needs to be some definitive fair use coverage for this sort of thing. If plaintiff's counsel can poke holes in an implied licensing defense, and the courts buy into it, a lot of search engines will be in big trouble, as an adverse ruling could also impact their ability to cache web pages (which is also the wholesale copying of sites).

I think of this as somewhat like billboards placed along the highway. We can all look at them as we drive by, just like we can surf into websites whether they are indexed or not. A robots tag is like saying "please don't photograph my billboard." Will that stop people from taking photos? To some extent, but you can bet that many still will. Then the question becomes, is photographing that billboard actionable? The answer may depend on the use of the taking, and that's where fair use should be looked at. Indexing and caching websites should be an activity that is encompassed within the fair use doctrine, or else the whole utility of search engines is greatly diminished.

The bottom line is: if you don't want crawlers getting into your online content but you want your users to, there is an easy technological cure. Use an image code verification script that requires the user to manually enter a randomly-generated character string. This is commonly used to exclude bots.
ReplyDelete
Replies
Mike11:49 AM
I see problems with your analogies.

First, photographing billboards analogy is plainly NOT the same slavish, wholesale copying of everything an advertising company or commercial entity has created that is the unlimited reproduced and redistributed. It would be a problem, and i'm sure that Mr. Patry would agree if you did go around copying every billboard, compiling it and subsequently making it available for free to everyone on your terms.

Second, I'm not sure that the temporary cacheing (as in google's cache) of the latest version of a website really is the same thing either. Most obviously, the cache is temporary, marked up as a cached copy, and does not necessarily contain every element (as the cache usually does not contain stylesheets, images, scripts, etc).

Finally, again, I'm not sure that making the copyright holder work harder is really the correct burden (as in the making of silly verify-as-human tricks). Moreover, most websites WANT to be found, spider and indexed. That seems plainly within a custom and usage implied license. That doesn't mean that they want or expect the aforementioned "archiving."
ReplyDelete
Replies
Seth Finkelstein11:38 AM
I've written a post with some technical speculations:

I've been writing about this at length over at my my own blog (Infothought), with some technical speculations and a counter-argument:

Internet Archive DMCA Circumvention Lawsuit
http://sethf.com/infothought/blog/archives/000877.html

Internet Archive DMCA "Circumvention" - Access vs. Copying
http://sethf.com/infothought/blog/archives/000878.html

Proposition: OPT-OUT controls are not DMCA access controls
http://sethf.com/infothought/blog/archives/000879.html
ReplyDelete
Replies
Anonymous8:20 PM
what ever you say here is not going to make me med.to even think it about it.because this is the best youll ever find.
ReplyDelete
Replies
Anonymous6:12 PM
sdgfd
ReplyDelete
Replies
Anonymous6:14 PM
wefse
ReplyDelete
Replies
Anonymous6:16 PM
bnnnnnnnnnnn
ReplyDelete
Replies
Anonymous6:17 PM
vbgh
ReplyDelete
Replies
Anonymous6:17 PM
vbcncfj
ReplyDelete
Replies
Anonymous6:18 PM
hjgk
ReplyDelete
Replies
Anonymous9:14 PM
This site/article is about providing you all the information you need to evaluate healthier now.
ReplyDelete
Replies
Anonymous1:39 PM
best site
ReplyDelete
Replies
Anonymous1:44 PM
best site
ReplyDelete
Replies
Anonymous1:48 PM
best site
ReplyDelete
Replies
Anonymous1:49 PM
best site
ReplyDelete
Replies
Anonymous2:16 AM
best site
fitness,
24 Hour Fitness, concept enters the 21st century
Lifetime Fitness,more information about this event
Bally Fitness,you might want to start lifting
Muscle And Fitness,articles and tips
Mens Fitness,more detail on google
Physical Fitness, information on the development and maintenance
Fitness Magazine,more on yahoo
Fitness Models,actress Photo galleries
Planet Fitness,more on msn
24hr Fitness,more on google
Fitness Equipmentmore information on yahoo
Health And Fitness,complete information about health and fitness
Health Fitness
Fitness Program,how to manage fitness program
Fitness Training,the best fitness trainig school
Diet Fitness,more diet information
Home Fitness,own care in home
Online Fitness Trainer,more detail in yahoo
Fitness Articles,about your fitness
Global Fitness,all links provided
Global Health And Fitness
Fitness Plans,planing your own fitness
Beauty Fitness,more detail in google
Global Fitness Center
Cybex Fitness Equipment, the best exercise equipment
Cybex International,finest human performance
ReplyDelete
Replies
Anonymous2:18 AM
hppy site
info-language,
Programming Language,In this online course
Language Translator,Dictionaries and Translators
Sign Language,Links to more information on msn
Body Language,best source for pricing information
English language,a page with links to resources
japanese language,products and information
korean language,Program is offered as
language tools,isting of programming
History Of C Language,more in google
French Translation dictionary languagemore ndetail in yahoo
Development Of English Language,not know that literacy experiences
Body Language Tattoo,the information was there if one
Baby Sign Language,more information in yahoo
Sign Language Online,link for all detail
English Official Language,best side for you
History Of Programming Language,old programming language
Yahoo Language Translator,more detail in yahoo
Google Language Translator,more detail in goole
English Language Test,test your english
Interpreting Body Language,all thing for you
ReplyDelete
Replies

Add comment