Google Print: Hypos for My Copyright Class
In my last post on Google Print—since rechristened Google Book Search—I raised a particular law and economics concern about the project (see Fair Use and Inefficient Bundling). That post generated a flurry of comments—thanks!—and now I want to try a different approach to assessing the project: the world of the law school hypothetical. Try two different versions of what Google Book Search is doing to evaluate the actual project.
Google Index: In truth, for most books, the index at the back of the book leaves much to be desired—academic speak for the average index stinks. In the rush to the finish the book, the author or the publisher slaps a list of terms and page numbers together, and voila, the index is done. Larry and Sergey know this, so they announce the Google Index project. Google will scour the world for the best indexers, promise them as much free chicken-apple sausage as they can eat, and give them each a stack of books. Read the book, create an index, and put the index online. (We could also imagine a wiki-index project if you prefer less centralization.)
Google Index would be searchable using the standard Google search structure and could be advertising supported. But here is how Google Index would differ from Google Book Search. Google Book Search returns one of three results depending on the copyright access Google has to the work. In some cases, you get the whole book, and your search takes you to the relevant page. In other cases, you can only move back and forward a couple of pages from the found page. And in the narrowest result, what Google calls the “snippet view,” you see only fragments of text—the search term in a limited text context—and the page numbers associated with that text. That view includes a link addressing the missing text. Says Google:
We respect copyright law and the tremendous creative effort authors put into their work. So, unless any given book’s publisher has given us permission to show sample pages, you'll only be able to see the Snippet View which, like a card catalog, shows information about the book plus a few snippets—a few sentences of your search term in context. If the book isn’t under copyright at all, you can browse the entire book in the Full Book View, but the aim of Google Book Search is to help you discover books and learn where to buy or borrow them, not read them from start to finish. It’s like going to a bookstore and browsing—with a Google twist.
The hypothetical Google Index would take one step back from the snippet view. It would return just the basic info on the book—author, title, ISBN and perhaps a link to Amazon to buy the book—and the page number relevant to the search, just like a paper index.
Google Digital Index. Version 2 of the hypo. Enough with breakfast says Google. Instead, of human indexers, Google takes physical copies of the books, digitizes them, sics high-end software on the digital copies, and produces an index for the books. Google destroys the digital copies, returns the physical books to the libraries, and opens for business. Again, a search on Google Digital Index generates only author/title/ISBN info and the page number in the physical book relevant to the search.
Where does this put Google? Is the index a derivative work? Does the presence of interim copies in the second version matter (I think Bill Patry says no)? Does the fair use analysis change if page numbers are returned as a search result rather than limited amounts of actual text plus page numbers as in the snippet view?
This is a great thought experiment! I think your two index examples would both pretty clearly be fair use. The courts have upheld the creation of "intermediate copies" (in Kelly v. Arriba Soft and Sega v. Accolade, for example), in cases where the creation of the copies was an essential step toward a use that is ultimately fair. This was a recognition, I think, that making copies is what computers do. If you apply a simplistic "no copying" rule to the digital world, you end up prohibiting a lot of otherwise fair uses based solely on the fact that they create intermediate copies as an incidental part of the operation of the software. I don't think that makes a lot of sense.
So I think most people would agree that the "digital index" version--in which the intermediate copy is discarded--is fair use. The more difficult question is whether it's a fair use if that copy is kept, not distributed, but used to create "snippets." I think the answer ought to be yes, for roughly the same reasons as I mentioned above: the copies are still "intermediate" in the sense that they are never made available in full to human beings, and the use to which they are being put (displaying snippets) is likely to be a fair use itself.
Here's something else to consider: Depending on the internal format Google uses, there might not be that much practical difference between an "index" and a "digital copy." If Google Print's Index included every word in the book with a list of every spot in the book where that word appeared, you might be able to re-construct the book from the index. (after all, a really good digital index should be able to search for words adjacent to each other, and that's impossible if the index merely contains words and page numbers.) What's the difference between a comprehensive index (which could be used to re-construct and display the book if you had the right software), versus a "digital copy" (which human beings can't see without the right software).
So it's not clear there's even a clear distinction to be draw between a "digital copy" and an "index." It would probably be a bad idea for fair use determinations to hinge on the precise format of the index.
Posted by: Tim Lee | November 18, 2005 at 08:55 AM
Provocative hypotheticals!
I think the argument for treating your Google Index entries as derivative works of the books indexed is weak. I suppose the argument would be something like, "The Index material includes creative expression (key words and phrases) copied from the indexed work. The fact that the copied creative expression is rearranged as an index is of no moment." The problem here is that the typical index entry is usually just a word or two (or three) long. Can the book author claim a protectable interest in creative expression that is two or three words long? In other contexts (e.g., cases about book titles or advertising phrases), courts have rejected such copyrightability claims. Perhaps the book author is on stronger ground against your Google Index because so many bits of expression are copied cumulatively in the Index (after all, the more comprehensive, the better the index).
Tim Lee raises, in a sense, a turbocharged version of that last point: If your Google Digital Index were powerful (e.g., allowing searches for phrases, even long phrases, and not just words), couldn't one reconstruct the book from the index? In the limit, isn't a digital index just the book in a different form? I suppose it matters whether, given what's in the (hypothetical) database, I can get more than merely a list of the pages on which a given word or short phrase occurs. Moreover, even if I could get a list of all the words that appear on a given page, it seems I wouldn't yet have the book author's creative expression (because I don't know the order in which the author put them).
Posted by: Joe Miller | November 18, 2005 at 10:20 AM
The talk about indexing is interesting, in that it highlights confusion over the basic idea of copyright. Some people think of copyright as a control right to prevent people from using a work at all (Bridgeport). Others think copyright is a market-protection scheme, and there is yet another view that copyright protects instantiations of the identifiable elements of a work.
I think the last view is the most accurate; indexes aren't taking the purpose of the work, and therefore don't seem bad.
As far as each bit of a work being part of the work, that is true, but the parts aren't equal. If I tell you about a book I read involving young people who discover they are wizards and have to fight off evil without letting the normal people know they exist, you'll probably think of Harry Potter. But that describes several books -- those elements are not protectable. It's tough to draw a line as to what is and isn't protected, but an index seems to fall short.
I like your example of taking all the words and scrambling the order; this demonstrates the ideas of important elements of a work, and that the protected content is something more than the sum of its parts.
Finally, I don't see that the ability to copy/obtain a work (by conglomerating Google book searches) makes a difference. It's possible to copy a rented DVD, but Blockbuster isn't culpable for copyright violation if I make a copy. Besides, the barrier to getting a book in digital form is fairly low, it just takes one person to spread copies -- this happens within a day or two for books like Harry Potter.
Posted by: D Conrad | November 18, 2005 at 08:57 PM
My understanding of the Google Digital Index as presented is that Google would scan the entirety of the whole book, run the algorithms, and then ditch the digital copy. As a result, I think there is a problem with the second hypo that is not present in the first.
This problem is, essentially, why should Google be allowed to create complete digital copies in order to create indexes? After all, do human authors require the full text to create an index? I do not believe so, because an index entry generally refers to only a limited portion of text. Thus, this copying in the entirety could pose problems in fair use, as the portion copied is larger than is necessary to the task.
While someone might think Kelly v. Arriba Soft is applicable to this question, this view would be in error, as the technologies and processes in Kelly and this hypo are considerably different. Kelly was about digital images, which due to compression algorithms can only be observed and shrunk once the whole file is copied and interpreted. In comparison, there is no such compression algorithm to contend with when examining books, nor does anyone likely require more than a handful of paragraphs or pages at a time to create an index.
More importantly, while Google Digital Index hypothetical claims to destroy the copies, let’s be honest, complete digital copies of books are pretty tempting things to keep around. Even if Google (or its competitors) is effective in eliminating the digital copies on its servers, there is always a possibility that an employee (or an employee of a competing service) will share a copy with a friend, who shares it with another friend, who does the same and so on until everyone’s friend on the Internet has a copy. Therefore, this hypo poses exactly the same problem as Lichtman pointed out with Google Book Search, which is that the process involved poses a considerable risk of digital piracy toward protected works.
Equally important to our consideration is that, in creating an index from a book, there is likely no fair use benefit from creating complete digital copies of books when smaller portions will suffice. Regardless of its legality, Google Book Search requires that Google retains complete digital copies, but not such need likely exists for Google Digital Index. As a result, fair use should not protect an entirely unnecessary process that creates such considerable risks to copyright holders, irrespective of whether the end result of the process is not infringing.
Posted by: Cory Hojka | November 22, 2005 at 07:45 PM