Have you ever seen a futuristic movie or TV show where a character asks a computer a question, and the computer replies “here’s a list of articles to read”?
Of course you haven’t, because while the search engine results page (SERP) was a wonderful improvement on directories and link farms and such, it’s obviously a transitional thing. It was a step towards what we now have the ability to do, with large language models, which is to answer the question.
But in order to answer all these questions, the LLM has to be trained on a lot of information. And right now – right this very minute – companies are training their LLMs on your content. They’re crawling your site, slurping up the intellectual property that you rely on to make a living, with the goal of displacing you and making you irrelevant.
Are you going to let that happen? I hope not. So here are the four things you need to do right now.
- Modify your robots.txt file to exclude these monsters.
That file tells crawlers whether or not they can index your site. But since you can’t trust them to honor what you say, you also need to …
- Have you IT folk block access at the server level.
Those two steps are designed to stop them from indexing your content, but you also need to make it both clear and legally enforceable. So …
- Change your terms and conditions to say that your content cannot be used to train LLMs.
- Start training your own LLM on your content.
Someone might say, “It’s hard for you to kick against the pricks,” stop trying to fight it. This is just the way it’s going to be. Or they’ll give some other lame excuse for why it’s in your best interests to do what’s actually in the best interests of Google, Open AI, Bing, etc.
Tell those people to soak their heads.
There’s nothing inevitable about where this is all going. The lawyers have hardly gotten started yet. Right now you need to work to protect your stake in the future.
It’s possible, for example, that there might be a central place where people ask their questions, and that system routes the question to some other LLM – like yours – to get a detailed answer based on your specific expertise.
So rather than Google (or whatever will replace Google) slurping up your content and providing the answer, the Google engine will say, “Ah, this is a question about 401ks. I need to ask the Kiplinger LLM about that one.”
Then Kiplinger can make an arrangement with Google to monetize that search.
I’m not predicting things will turn out that way. There are lots of possibilities. I’m only saying that’s one possible future that protects the publisher’s intellectual property.
The way we’re going now, there will be no protections. Publishers will lose. Case closed.
So if you want to have any chance whatsoever to maintain your rights to your content, do the four things I mention above.
If you want to chat about it, give me a call.
Links
The New York Times prohibits scraping of its content for AI training.
Any media company banking on legal intervention to protect copyright might be disappointed.