AI and the SBO Forums--Part 1

Apr 25, 2024
712
Fuji 32 Bellingham
Some experts think the future of AI is not in Large Language Models like we have now, but in more specialized and limited models. Imagine, for example, having a model trained on Rod Collins’ and Nigel Calder’s writing, without the cruft of also knowing everything that’s been written on forums about boat wiring. Couldn’t answers from something like that be useful for at least some questions that come up here? And if it were trained on a more specific set like that maybe you could avoid some of the other problems the article talked about from the influence of other geographically or stylistically different sources.
I started a long response to this but, as is my tendency, I got into the weeds with a bunch of details most people don't care about. LLMs are a big part of what I do, lately, so I get excited when the subject comes up. Here's my attempt at a shorter version (yes, this is the shorter version):

An LLM is basically a model trained to predict the next token in a sequence of text. Think of giving a computer a pile of sheet music so it learns common note patterns. Then you give it the first part of a song it’s never seen before; it completes the rest based on what’s "most likely" given what it has learned. Music follows patterns, so it’s surprisingly predictable. Human language does too.

It's important to distinguish between LLMs and GPTs. A GPT (such as used by ChatGPT, Gemini, etc.) is a type of LLM. (Here, I am using "GPT" somewhat loosely because most people don't care about the details.) An LLM can be trained on any language corpus. Some LLMs, like a GPT, are trained on a massive and diverse corpus so they can respond better across a variety of topics. So, a GPT is just an LLM with really broad training that isn't domain-specific.

Now to your idea: What if we trained a model only on a handful of really solid boating sources (Calder, etc.) and nothing else?

If you trained a model from scratch only on those books, you'd end up with something that "knows" a bit about boats but is pretty bad at language, in general. To read and understand a book on boating requires you to have read and understood countless other texts, first. Such a model wouldn't have seen enough variety to handle the questions people actually ask, much less make any sense out of what it reads, and it would be very brittle outside the phrasings and patterns in that small corpus. It would behave more like a slightly fancy search function over those books than like a flexible and intelligent assistant.

What we would actually do instead is:
  • first train a big general model on a huge, messy corpus so it gets good at language in general,
  • then specialize it on boating material by either:
    • fine-tuning it on boating texts so it leans toward those, or
    • keeping it general but wiring it to search your boating library and then summarize what it finds (retrieval-augmented generation).
(I actually do this all the time.)

So, a specialized boating assistant might still be a general LLM under the hood, but it's being steered to rely heavily on those trusted boating sources when it constructs its answer.

That still doesn't fix the core issue of "information pollution", though. The problem is that a LLM is optimized to produce a plausible continuation of a conversation, not to admit "I don’t know." So it behaves a bit like that guy who always has an answer, even when he doesn't know. Often useful, sometimes just making confident noise. The difference is that the model doesn't have intentions or an ego. It’s not lying, it’s just pattern-matching in a way that focuses on the wrong patterns or lacks sufficient training to correctly complete the pattern.

The reason tools like ChatGPT are still useful is that the system around the model can be stricter than the model itself. It can, for example, run a web or document search over known-good boating sources, feed those results into the model, and then have the model synthesize an answer with citations. You still need to sanity-check the result when it matters but, as has been noted several times, the same is true of human advice. At least with a LLM, there’s no pride or agenda in the mix.
 

dLj

.
Mar 23, 2017
4,759
Belliure 41 Back in the Chesapeake
I started a long response to this but, as is my tendency, I got into the weeds with a bunch of details most people don't care about. LLMs are a big part of what I do, lately, so I get excited when the subject comes up. Here's my attempt at a shorter version (yes, this is the shorter version):

An LLM is basically a model trained to predict the next token in a sequence of text. Think of giving a computer a pile of sheet music so it learns common note patterns. Then you give it the first part of a song it’s never seen before; it completes the rest based on what’s "most likely" given what it has learned. Music follows patterns, so it’s surprisingly predictable. Human language does too.

It's important to distinguish between LLMs and GPTs. A GPT (such as used by ChatGPT, Gemini, etc.) is a type of LLM. (Here, I am using "GPT" somewhat loosely because most people don't care about the details.) An LLM can be trained on any language corpus. Some LLMs, like a GPT, are trained on a massive and diverse corpus so they can respond better across a variety of topics. So, a GPT is just an LLM with really broad training that isn't domain-specific.

Now to your idea: What if we trained a model only on a handful of really solid boating sources (Calder, etc.) and nothing else?

If you trained a model from scratch only on those books, you'd end up with something that "knows" a bit about boats but is pretty bad at language, in general. To read and understand a book on boating requires you to have read and understood countless other texts, first. Such a model wouldn't have seen enough variety to handle the questions people actually ask, much less make any sense out of what it reads, and it would be very brittle outside the phrasings and patterns in that small corpus. It would behave more like a slightly fancy search function over those books than like a flexible and intelligent assistant.

What we would actually do instead is:
  • first train a big general model on a huge, messy corpus so it gets good at language in general,
  • then specialize it on boating material by either:
    • fine-tuning it on boating texts so it leans toward those, or
    • keeping it general but wiring it to search your boating library and then summarize what it finds (retrieval-augmented generation).
(I actually do this all the time.)

So, a specialized boating assistant might still be a general LLM under the hood, but it's being steered to rely heavily on those trusted boating sources when it constructs its answer.

That still doesn't fix the core issue of "information pollution", though. The problem is that a LLM is optimized to produce a plausible continuation of a conversation, not to admit "I don’t know." So it behaves a bit like that guy who always has an answer, even when he doesn't know. Often useful, sometimes just making confident noise. The difference is that the model doesn't have intentions or an ego. It’s not lying, it’s just pattern-matching in a way that focuses on the wrong patterns or lacks sufficient training to correctly complete the pattern.

The reason tools like ChatGPT are still useful is that the system around the model can be stricter than the model itself. It can, for example, run a web or document search over known-good boating sources, feed those results into the model, and then have the model synthesize an answer with citations. You still need to sanity-check the result when it matters but, as has been noted several times, the same is true of human advice. At least with a LLM, there’s no pride or agenda in the mix.
There are two fundamental concerns I see in all of the above.

1) Lack of "original thought" through AI and what could easily become squelching of advancements of original thought given the power of repetitively stating the same "truths' over and over.

2) There is an underlying assumption that the knowledge already exists in the "known-good boating sources". In fact, many of the current "known-good sources" are inherently flawed. A simple example is the book "The Boatowner's Guide to Corrosion" by Everett Collier. While that is considered "the best we have" at the moment on this subject, it was written by an electrical engineer that didn't really know corrosion at a "state of the art" level known at the time of it's writing in the field of corrosion. Furthermore, any particular field that has some fundamental reference material associated with it advances over time and those advancements require updating previously held beliefs. I don't see any methodology within AI to support this.

dj
 
Apr 25, 2024
712
Fuji 32 Bellingham
There are two fundamental concerns I see in all of the above.
Neither of those points is really relevant to what I posted. Maybe you are responding the whole of the thread?

I was describing how LLMs work - and whether or not one could train them on a small corpus and still get good results. You are commenting on two other issues - I guess the first being whether or not it would be a good idea to try, in the grand scheme of things, socially. That's a valid concern, but not related to what I wrote.

The second, I guess is kind of relevant, but demonstrates another reason one would not want to train it on a narrow set of texts, but instead on a broader corpus. In a way LLMs democratize "truth", provided they are trained on a sufficiently broad corpus. That is, a fringe opinion can be seen as fringe, when a contrary opinion is more prevalent. Again, it knows nothing about what it true - it only knows what people have said is true. (And, strictly speaking, it doesn't even know that.)

So, if one "trusted" text says one thing, but the overwhelming body of evidence says something else, the LLM is not easily fooled. Humans, on the other hand, buy into things like climate change denial, when the overwhelming consensus is contrary.

LLMs have to be forced to be biased. They are inherently unbiased. Again, they are just machines with one task - to come up with the next token in a sequence. Or, in lay terms, to say what is most likely to be said next, in a conversation. They can only be biased when their training sources are selective or unintentionally biased - or if they are directed to enforce a particular bias.

If you understand them as a tool to summarize a consensus - not as a truth engine, then you're fine. It is a tool that gives you insight into a staggeringly massive amount of written and spoken language - not one that will tell you hard facts, reliably.

The main short-term hazard is that they are mostly correct most of the time and present everything with confident language. Humans are easily fooled by anyone that sounds like they know what they're talking about. We are not well-equipped to deal with people who fabricate with complete confidence and without hesitation. We misinterpret the product and want it to be something it isn't.

Longer term, there is an unknown social cost/benefit calculus - something you partially touch on with concern #1. That is well outside the scope of what we can discuss here, but is something I think a lot about. I can say that most of the things people are concerned about just demonstrate a lack of understanding. But, there are some things that people really should worry about that they simply aren't aware of. I sometimes compare it to hiding from the bogeyman in poison ivy.
 
  • Like
Likes: jssailem

dLj

.
Mar 23, 2017
4,759
Belliure 41 Back in the Chesapeake
Neither of those points is really relevant to what I posted. Maybe you are responding the whole of the thread?
I felt they were relevant to your post. But I do also try to keep my posts relevant to the entire thread so there is that component.

I was describing how LLMs work - and whether or not one could train them on a small corpus and still get good results. You are commenting on two other issues - I guess the first being whether or not it would be a good idea to try, in the grand scheme of things, socially. That's a valid concern, but not related to what I wrote.
I understand how LLMs work. I've actually built them (well, I've been consulted on how to build them by others that are doing the building) for specific focused applications.

The second, I guess is kind of relevant, but demonstrates another reason one would not want to train it on a narrow set of texts, but instead on a broader corpus. In a way LLMs democratize "truth", provided they are trained on a sufficiently broad corpus. That is, a fringe opinion can be seen as fringe, when a contrary opinion is more prevalent. Again, it knows nothing about what it true - it only knows what people have said is true. (And, strictly speaking, it doesn't even know that.)
You are missing my point - I did not say to use a narrow set of texts, I simply used a specific example showing how it can be erroneous.

You are actually confirming that part of the essence of what I'm addressing. You state "In a way LLMs democratize "truth". What you are pointing out is that these models aggregate broad sets of data and essentially take the most commonly held belief and consider that to be the "right" answer. In cases that may be correct, on others it may not be correct. All original ideas that dispel current incorrectly held beliefs are by definition a "fringe opinion".

So, if one "trusted" text says one thing, but the overwhelming body of evidence says something else, the LLM is not easily fooled. Humans, on the other hand, buy into things like climate change denial, when the overwhelming consensus is contrary.
You say the LLM is not easily fooled I strongly disagree with that. It is neither "fooled" nor "finds the truth". It simply regurgitates the most commonly held beliefs and presents them as "the truth".

LLMs have to be forced to be biased. They are inherently unbiased. Again, they are just machines with one task - to come up with the next token in a sequence. Or, in lay terms, to say what is most likely to be said next, in a conversation. They can only be biased when their training sources are selective or unintentionally biased - or if they are directed to enforce a particular bias.
You state They are inherently unbiased. That is not correct. Just in setting them up, they are created with inherent bias from those that are setting them up. Those that are setting them up have inherent bias and that bias is transferred to the model. Of course they can also be set up to have an even stronger bias - as you talk about above.

If you understand them as a tool to summarize a consensus - not as a truth engine, then you're fine. It is a tool that gives you insight into a staggeringly massive amount of written and spoken language - not one that will tell you hard facts, reliably.
This statement here supports my entire post.

The main short-term hazard is that they are mostly correct most of the time and present everything with confident language. Humans are easily fooled by anyone that sounds like they know what they're talking about. We are not well-equipped to deal with people who fabricate with complete confidence and without hesitation. We misinterpret the product and want it to be something it isn't.
I think your description of the short-term hazard is in itself biased, and the real short-term hazard is actually that you have no idea if it is mostly correct, or mostly incorrect.

Longer term, there is an unknown social cost/benefit calculus - something you partially touch on with concern #1. That is well outside the scope of what we can discuss here, but is something I think a lot about. I can say that most of the things people are concerned about just demonstrate a lack of understanding. But, there are some things that people really should worry about that they simply aren't aware of. I sometimes compare it to hiding from the bogeyman in poison ivy.
I agree we really have no idea of the long term social cost/benefit of this technology. About the only thing we know, is there will be good things that come out of it and not good things that come out of it. Just like any other new technology. But I think this technology has the potential to impact us in more ways than we are even thinking of yet.

That fact is one of the fundamental reasons why in my opinion AI should be always clearly referenced if being used as part of anyone's response.

dj