Imnimo t1_j9w6m9c wrote on February 25, 2023 at 12:09 AM

Reply to comment by Hyper1on in [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt

How do you distinguish a behavior which is incentivized by the training objective and behavior that is the result of an optimization shortcoming, and why is it obvious to you that this is the former?

Hyper1on t1_j9wbysn wrote on February 25, 2023 at 12:49 AM

Well, the obvious optimisation shortcoming is overfitting. We cannot distinguish this rigorously without access to model weights, but we also have a good idea what overfitting looks like in both pretraining and RL finetuning (in both cases it tends to result in common repeated text strings and a strong lack of diversity in output, a sort of pseudo mode collapse). We can test this by giving Bing GPT the same question multiple times and observing if it has a strong bias towards particular completions -- having played with it a bit I don't think this is really true for the original version, before Microsoft limited it in response to criticism a few days ago.

Meanwhile, the alternative hypothesis I raised seems very plausible and fits logically with prior work on emergent capabilities of LLMs (https://arxiv.org/abs/2206.07682), since it seems only natural to expect that when you optimise a powerful system for an objective sufficiently, it will learn instrumental behaviours which help it minimise that objective, potentially up to and including appearing to simulate various "personalities" and other strange outputs.

Personally, as a researcher who works on RL finetuned large language models and has spent time playing with many of these models, my intuition is that Bing GPT is not RL finetuned at all but is just pretrained and finetuned on dialogue data, and the behaviour we see is just fairly likely to arise by default, given Bing GPT's particular model architecture and datasets (and prompting interaction with the Bing Search API).

Imnimo t1_j9x01v0 wrote on February 25, 2023 at 4:04 AM

Overfitting is just one among many possible optimization failures. While these models might over-memorize portions of training data, they're also badly underfit in many other respects (as evidenced by their frequent inability to answers questions humans would find easy).

If Bing is so well-optimized that it has learned these strange outputs as some sort of advanced behavior to succeed at the LM or RLHF tasks, why is it so weak in so many other respects? Is simulating personalities either so much more valuable or so much easier than simple multi-step reasoning, which these models struggle terribly with?

Hyper1on t1_j9y3vz1 wrote on February 25, 2023 at 12:11 PM

I mean, I don't see how you get a plausible explanation of BingGPT from underfitting either. As you say, models are underfit on some types of data, but I think the key here is the finetuning procedure, either normal supervised, or RLHF, which is optimising for a particular type of dialogue data in which the model is asked to act as an "Assistant" to a human user.

Part of the reason I suspect my explanation is right is that ChatGPT and BingGPT were almost certainly finetuned on large amounts of dialogue data, collected from interactions with users, and yet most of the failure modes of BingGPT that made the media are not stuff like "we asked it to solve this complex reasoning problem and it failed horribly", they are instead coming from prompts which are very much in distribution for dialogue data, such as asking the model what it thinks about X, or asking the model to pretend it is Y and you would expect the model to have seen dialogues which start similarly before. I find underfitting on this data to be quite unlikely as an explanation.