Google vs. publishers. Part 1 — Copyright

The tug of war between google and publishers

Introduction

In this 4-part series I’m going to try to give some perspective on how the publishing industry got into its current mess with artificial intelligence.

Part 1 covers some background information about copyright. In Part 2 I’ll discuss the search bargain – why publishers allowed tech giants to crawl and index their content. In Part 3 I’ll talk about how generative AI has changed the equation. In Part 4 I’ll discuss what comes next.

Part 1 – Copyright

As I’ll discuss in more depth later, Google is a copyright anarchist. They take other people’s work and do what they please with it. This is partly a matter of business operations and partly a result of their mission statement.

Technology disrupts copyright

To some extent you can think of copyright as “the right to copy.” As publishing technology got better, it became easier and easier to make reproductions. Before the dawn of modern copyright with The Statute of Anne in the early 18th century, these issues were controlled by royal charter or privileges granted to specific printers or publishers. The Statute of Anne vested that right in the author.

The more relevant point for our discussion is how easy it has become, over the years, to copy things.

When I was a lad it was common to buy an album and copy it to a cassette tape. Sometimes we’d make copies for friends. That was a violation of the music company’s copyright, but we didn’t care, and nobody enforced it.

Copyright Anarchists

With every increase in technology it gets easier to produce content – and to make copies – and this has contributed to the idea of universally accessible information.

In the early days of the Internet, a lot of people saw that dream coming true. The Internet could give everyone in the world access to all the world’s information. A substantial subset of the technically inclined people and businesses promoted an attitude that “information should be free” and freely accessible.

Napster was a good example. They created a file-sharing site that took what I was doing as a kid – making copies of albums on cassettes – and shot it to the moon. Everyone who joined Napster had access to everyone else’s digital music library. It was a clear copyright violation and put a big dent in music sales.

Eventually they were shut down, but Napster was emblematic of the problem – the clash between free access to content and copyright.

A friend was a big fan of Napster and I challenged her on the ethics of taking copyrighted content like that. (By this time in my life I had given up on my former thieving ways and believed in copyright.) She said that taking a physical object is stealing, because it robs the owner of the object. But copying a digital object doesn’t harm the owner in any way. He keeps what he had.

When I asked about the harm done to the musicians and music companies, she said they were greedy bastards and deserved it anyway.

Google is the ultimate corporate expression of this “information wants to be free” attitude. Their mission is “to organize the world’s information and make it universally accessible and useful.” That means free, which they’ve confirmed multiple times in their annual reports and other public statements.

Google wants content to be free, supported by ads that they control.

Who Benefits from Copyright?

We’re going to see a pattern here, which is that Big Tech efforts to do what they want with content actually undermine what the Big Tech company is trying to do.

Copyright is an important legal concept that protects individual creators and corporations, and helps society as a whole.

It helps individual creators by giving them the right to determine how their content will be used. If there were no such protection, creators would have very little incentive to make anything because big companies would just take it and use their greater resources to outcompete the creator.

It helps companies in a similar way. When a company buys the rights to some content, they don’t want their competitors stealing their ideas.

It helps society by providing an engine for creativity, new ideas, and advances – in technology, in art, in science, and in many other ways.

Exceptions to Copyright

Copyright isn’t absolute. We grant exceptions for fair use, where an author can use a small portion of somebody else’s copyrighted material – without their consent – provided they give credit to the original creator.

There are special provisions in copyright law that allow libraries to do things that would otherwise infringe on copyright. The goal is to balance the interests of copyright holders and the general public.

Lending a book is not considered a copyright violation due to things such as the first-sale doctrine, which also allows a book owner to give away or sell a book.

These legal principles were all established before the advent of digital content, and some of them don’t apply to digital content as well as they apply to physical media. For example, lending a digital book is a very different thing than lending a physical book.

AI and Copyright

The way a computer processes information is very different from the way a human does. For example, when a computer plays chess, it compares the configuration of the board to a database of millions of chess games. It makes its move based on which move from that configuration ended up winning. I’m no expert on chess, but that’s very different from the way a human plays the game.

In the same way, the way AI processes words is very different from the way a human does. A word like “king” is represented in an AI system by a multidimensional vector. Something like [0.2, -0.4, 0.7, …] There might be hundreds of dimensions, and the values are assigned so that words with similar meanings are located close to one another in this multi-dimensional space. Words might be close along one axis and not on another. “King” and “queen” are close in the context of being a ruler, but they’re not close in the context of sex, while “king” and “duke” are close in the context of sex, but a little farther apart in rank.

These sorts of vectors allow a computer to use math to figure out words, like king – man + woman = queen.

AI models don’t understand anything. They just have a complicated mathematical representation of words and phrases that are derived from processing huge amounts of text.

Some people claim that allowing a large language model like ChatGPT to be trained on a library of information is no different than a human going to library, reading 100 books, and writing a new book on that topic.

There are some important differences.

  • A human author does not have perfect recall. A computer can.
  • A human processes the information in a completely different way. In other words, all the balancing and tinkering and rules we’ve developed about libraries were constructed with humans in mind, not computers. It’s not likely they’ll all apply in the same way.
  • When a human relies on a particular source extensively, he’s obligated to cite it. AI doesn’t do that. It doesn’t give any credit to the original author.

Possible solutions

In addition to thinking through all our previous assumptions about copyright in light of AI, we need to make distinctions between works in the public domain and works that are still under copyright. For example, it might be acceptable for AI to have unlimited access to works in the public domain, but would have to get permission from the copyright owner for works that are not.

The very technology that makes a large language model work could be used to create expectations about when the AI needs to give credit to the original author.

Remember those multidimensional vectors I mentioned? A similar technology could determine how similar the output of an AI system is to any given work, and different rules could be established for degrees of similarity. For example, if the output is very similar to pre-existing text, the AI might have to provide a citation, or even to pay for the use of that content.

Finally, copyright isn’t the only way to regulate access to content. There’s also something called Creative Commons, which allows copyright owners to set the terms under which their creation can be used. It has the benefit of not being tied to any one jurisdiction’s laws.

Takeaways

While issues about copyright are being worked out in the legal system, publishers should strengthen their company’s position on copyright and challenge any violations. Specifically, they should change their terms and conditions to make it clear that their content may not be used to train large language models without the publisher’s explicit consent.

It would also be a good idea to brush up on creative commons licensing to see if that might be a useful tool for defining how your content could or could not be used in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *