Publishers, AI, and copyright. Some ideas for moving forward

editor and robot
Summary: The article discusses the ongoing dispute between publishers and artificial intelligence (AI) in the context of copyright law. It highlights the historical influence of copyright anarchists on the internet, emphasizing Google’s bias towards free content. The benefits of copyright law are outlined for individual creators, corporations, and society, while also acknowledging the importance of exceptions like “fair use” and libraries. The article raises questions about the use of large language models (LLMs) and suggests the need for rules to calculate similarity to copyrighted works and source citation in AI-generated content. It also recommends considering Creative Commons licenses as an alternative approach for publishers in addition to pursuing copyright challenges.

I’m not an expert on copyright law, but I can provide some background and context, and maybe some ideas, about how copyright plays into the on-going dispute between publishers and artificial intelligence.

The first thing to recognize is that the internet, and the tech space generally, was strongly influenced by copyright anarchists. There was a lot of talk about how “content wants to be free,” and the internet was seen as a mechanism for that.

Google is biased in that direction. Their mission is “to organize the world’s information and make it universally accessible and useful.” And if you read what they say about that mission, it definitely shows a bias towards free content. They’ve also shown very little care about copyright.

Who benefits from copyright law?

  • Individual creators benefit because they have a chance to make a profit from their work. There’s little incentive to create a new work if some corporation — with a much larger marketing and production budget — can steal your idea.
  • Corporations benefit when they buy or license content from creators. (E.g., Disney lobbied to extend copyright protections so they could continue to make money off Mickey Mouse, but they also use works in the public domain, like The Hunchback of Notre Dame.)
  • Society as a whole benefits. Copyright laws enable investment in new ideas, but when copyright laws are too strict they can slow down innovation.

There are exceptions to copyright, like “fair use” — which allows a limited use of copyrighted material without getting permission from the author. Fair use is important for research, education, criticism, commentary, and building on pre-existing concepts. It’s very important in academics.

Libraries and other forms of lending are also a kind of exception to copyright, and they serve the social benefit of giving the poor access to material they can’t afford.

Libraries raise an interesting question about LLMs. If a large language model is trained on 1000 books, how is that different from a person going to the library, reading 1000 books, then writing what he thinks about the topic?

It’s an interesting question, but I think it’s not the best analogy.

  1. A human author still has to abide by rules about exact quotes. LLMs have been found to quote the exact language of copyrighted works.
  2. A human does not have perfect memory of what he read. An LLM does.
  3. Publishers allow their content to be loaned by a library — at least in part — because it helps to promote sales of the work, through word of mouth, reviews, and such. There doesn’t seem to be an analogy to how an LLM uses the content.

The purpose of copyright is to encourage people to create new works by providing a mechanism for them to profit from them. This is good for society because it fuels innovation. If authors know their work will immediately be stolen by AI, there is less chance for profit.

But copyright isn’t forever. At some point is enters the public domain, and there are legitimate arguments over how long a copyright should persist. Rule governing AI need to take into account public domain vs. copyrighted works.

Copyright varies country to country, but the internet is international. This complicates the matter.

Solutions

There has to be a way to calculate the proximity of one set of words to another, and the extent to which copyrighted material is used in a chat response. Rules could be set around this. For example, within a low range of proximity, the LLM doesn’t need to cite its sources. Within a medium range of proximity, it needs to cite and link to the source. A high range of proximity would illegal if copyrighted material makes up a certain percentage of the response.

Creative commons licensing might offer another option. This allows the author to specify how their content may be used. Creative Commons licenses were created to work internationally, so that avoids some of the shortfalls of copyright.

Publishers should continue to pursue copyright challenges to LLMs, but they should also consider using the Creative Commons path — that is, to specify the terms under which their content may be used.

Leave a Reply

Your email address will not be published. Required fields are marked *