I started a long response to this but, as is my tendency, I got into the weeds with a bunch of details most people don't care about. LLMs are a big part of what I do, lately, so I get excited when the subject comes up. Here's my attempt at a shorter version (yes, this is the shorter version):Some experts think the future of AI is not in Large Language Models like we have now, but in more specialized and limited models. Imagine, for example, having a model trained on Rod Collins’ and Nigel Calder’s writing, without the cruft of also knowing everything that’s been written on forums about boat wiring. Couldn’t answers from something like that be useful for at least some questions that come up here? And if it were trained on a more specific set like that maybe you could avoid some of the other problems the article talked about from the influence of other geographically or stylistically different sources.
An LLM is basically a model trained to predict the next token in a sequence of text. Think of giving a computer a pile of sheet music so it learns common note patterns. Then you give it the first part of a song it’s never seen before; it completes the rest based on what’s "most likely" given what it has learned. Music follows patterns, so it’s surprisingly predictable. Human language does too.
It's important to distinguish between LLMs and GPTs. A GPT (such as used by ChatGPT, Gemini, etc.) is a type of LLM. (Here, I am using "GPT" somewhat loosely because most people don't care about the details.) An LLM can be trained on any language corpus. Some LLMs, like a GPT, are trained on a massive and diverse corpus so they can respond better across a variety of topics. So, a GPT is just an LLM with really broad training that isn't domain-specific.
Now to your idea: What if we trained a model only on a handful of really solid boating sources (Calder, etc.) and nothing else?
If you trained a model from scratch only on those books, you'd end up with something that "knows" a bit about boats but is pretty bad at language, in general. To read and understand a book on boating requires you to have read and understood countless other texts, first. Such a model wouldn't have seen enough variety to handle the questions people actually ask, much less make any sense out of what it reads, and it would be very brittle outside the phrasings and patterns in that small corpus. It would behave more like a slightly fancy search function over those books than like a flexible and intelligent assistant.
What we would actually do instead is:
- first train a big general model on a huge, messy corpus so it gets good at language in general,
- then specialize it on boating material by either:
- fine-tuning it on boating texts so it leans toward those, or
- keeping it general but wiring it to search your boating library and then summarize what it finds (retrieval-augmented generation).
So, a specialized boating assistant might still be a general LLM under the hood, but it's being steered to rely heavily on those trusted boating sources when it constructs its answer.
That still doesn't fix the core issue of "information pollution", though. The problem is that a LLM is optimized to produce a plausible continuation of a conversation, not to admit "I don’t know." So it behaves a bit like that guy who always has an answer, even when he doesn't know. Often useful, sometimes just making confident noise. The difference is that the model doesn't have intentions or an ego. It’s not lying, it’s just pattern-matching in a way that focuses on the wrong patterns or lacks sufficient training to correctly complete the pattern.
The reason tools like ChatGPT are still useful is that the system around the model can be stricter than the model itself. It can, for example, run a web or document search over known-good boating sources, feed those results into the model, and then have the model synthesize an answer with citations. You still need to sanity-check the result when it matters but, as has been noted several times, the same is true of human advice. At least with a LLM, there’s no pride or agenda in the mix.