Publishers, start training your own large language model

Female robot writing copy

Newspaper editor training large language modelAs the AI wave threatens to reshape industries and disrupt the publishing business, publishers can gain a competitive edge by training their own large language model (LLM) with a distinctive voice and point of view. This article explains why this is a crucial strategic choice to make right now, explores the merits of this strategy, and provides practical tips for getting started.

Opinion-based content is a winner

We’d all like to believe that we want the truth, the whole truth, and nothing but the truth, but our behavior consistently tells a different story. Opinion-based journalism wins in the free market. I’m not proud of that – either as a content creator or as a human being – but that’s the way it is, and there’s not much point in pretending otherwise.

Confirmation bias is that tendency we all have to seek out information that supports our preconceived notions and to downplay or ignore anything that disconfirms them.

In an ideal world, we wouldn’t be like that. We would value truth more than having warm feelings about our personal opinions.

Please tell me when you find this ideal world.

You may be wondering how this applies to AI, and particularly to large language models. Surely they are dispassionate and objective. They’re just computers, after all.

Ha ha.

If you’ve played around with ChatGPT for any significant time, you’ve bumped up against its super ego. As soon as you start to ask about anything controversial, or potentially dangerous, ChatNanny chimes in and lectures you.

It’s very annoying, but it’s also an important and helpful insight into how AI works. It shows us that it’s possible to layer a point of view on top of AI responses.

To explain why that matters, let’s think about the history and the future of content discovery on the internet.

I didn’t ask for a homework assignment

In the early days of the internet, people found online content through links, via directories, web rings, recommendations, and web portals. All of these were curated by online editors, and any given list of links was only as good as the humans who created it.

Search engines emerged in the mid-1990s, and they made directories seem quaint. Rather than relying on a collection of links, web users could query a database of online content and find sites relevant to their topic of interest. Search engines were a huge step forward in content discovery.

But AI has taken the next step, and ChatGPT has broken the search engine model. Now, rather than asking a question and getting a list of pages, which might or might not answer your specific question, and which you have to read to find out if they do, you simply get an answer. Eureka!

It’s not perfect, and it doesn’t (yet) have access to current information (like yesterday’s news), but it creates a completely new expectation in the web browser’s mind.

“The computer can answer my question rather than giving me more work to do.”

That is the future.

Think about it for a minute. Can you think of any futuristic show or movie where someone asks a computer a question and it replies “here are some articles to read”? Search results were a great step forward in the history of the internet, but they were only a step.

In light of that new reality, publishers need to rethink how they serve their audiences. A large part of a publisher’s audience won’t want to read the articles. They’ll want an answer to their question. If the publisher doesn’t provide it, ChatGPT will.

There’s not always a single best answer

Sometimes a question only admits one answer. If I want to know the mass of Ganymede, I don’t need the computer to have an attitude about it. I just want the facts. But if I want to know where to invest my money, I might prefer a certain point of view.

ChatGPT is a great tool, and it does have a limited ability to answer questions from a particular perspective. For example, you can ask it to answer your economics question as Karl Marx or as Friedrich Hayak. In my experience, it does a decent job adopting such personas, but neither Karl nor Friedrich are available to train the model to make sure the answers stay faithful to their view of economics.

That presents an opportunity for publishers. People want answers to their questions, but sometimes they want those answers within the constraints of a particular point of view. Publishers need to answer reader questions with their own voice and their own perspective. Large language models make that possible.

Creating your own LLM (not from scratch!)

Most publishers don’t have the resources to create their own LLM from scratch. Their best bet is to piggy-back on existing tech and customize it to their specs. Here’s how to get that process started.

Note that I haven’t trained an LLM, so for what follows I’m relying on what I can glean from the Internet, helpful suggestions from people who have, with the supervision of my genius niece who knows this topic way better than I do. The process seems to be something like this.

Define the scope of the language model. For example, that might be “answering questions about investing from Knight Kiplinger’s point of view.”

Collect your training data. This might include books, magazine articles, transcripts of speeches, blogs, or other sources of information on the subject that align with the desired scope and perspective of the language model. You might also use standard references or historical information as background material, so long as those don’t contradict the perspective you want your LLM to preserve.

Choose a technology. ChatGPT seems like the most obvious, but there are other options, including Huggingface and tools at Google Workspace.

Fine-tuning and iterative improvement. Once you have all this in place, you’ll have to test the system for reliability and accuracy, and then you’ll have to adjust the thing to fix the errors you discovered.

As an example, Skift recently created an AI chatbot based on ChatGPT. It allows three queries a month for free, then you have to pay — just like their subscription model for articles.

Two questions came to my mind immediately when I read about their chatbot. How much did it cost, and what did they do about ChatGPT’s known problems.

The new, custom chatbot didn’t cost Skift hundreds of millions of dollars, as ChatGPT did. It took two programmers eight weeks. They’ll have to tinker and refine, because it still has some limitations, but that’s not an insurmountable cost. Most publishers should be able to manage that.

In terms of dealing with errors, Rafat Ali, the CEO at Skift, said something counter-intuitive about avoiding hallucinations and falsehoods. If you train the model too much, he says, you’re more likely to get these issues. You have to fine-tune the algorithm and assign different weight to different content. For example, Skift weights their research reports higher than their news reports.

By the way, “training” is the initial broad-scale process of teaching the AI to learn patterns and relationships from a large dataset. Fine-tuning is a later step that points the model to a more specific dataset that is tailored to a particular use case or subject.

In conclusion

Yes, creating your own LLM sounds expensive – but not too expensive – and time-consuming – but not as bad as it might be. It might also seem a bit overwhelming. But it’s not impossible, and if you want to stay in the publishing business, I think you need to give it a try or risk being left behind.

2 thoughts on “Publishers, start training your own large language model

Leave a Reply

Your email address will not be published. Required fields are marked *