Aligned LLMs such as InstructGPT and ChatGPT are trained via supervised fine-tuning after the initial self-supervised pretraining. Then, the researchers train a reward model on responses ranked by humans.

When I understand correctly, they let the LLM generate responses that humans have to rank on a scale from 1-5. Then, they train a reward model (I suppose in supervised fashion?) on these ranked outputs. Once that's done, they use reinforcement learning (RL) with proximal policy optimization (PPO) to update the LLM.

My question is why they use RL with PPO for this last step? Why don't they fine-tune the LLM using regular supervised learning, whereas the human-ranked outputs represent the labels. Since these are labels in the range 1-5, this could be a ranking or ordinal regression loss for supervised learning.

Comments

You must log in or register to comment.

koolaidman123 t1_j6wtmdj wrote on February 2, 2023 at 1:31 PM

Outputs are not ranked 1-5, they're ranked 2 at a time head to head and the rm predicts which is more favored by humans
Empirically they found rl outperformed supervised fine-tuning (sft) on human evaluations, meaning humans generally preferred the rlhf model vs the sft model. The sft model was ft using the top ranked answer

As to why rl outperform sft, not a lot of orgs have the resources to test this (yet), I've heard a plausible theory from ai2 that the main difference comes from the fact that sft uses a token level loss, whereas rl loss takes the entire sentence, so maybe instead of rl being "better" its just next token prediction task is worse

Reseachers ive spoken with dont believe rl is the critical component to enable these models, and that we could eventually discover the right training regime to enable sft to perform on par (or better) than rl

alpha-meta OP t1_j6wvgbr wrote on February 2, 2023 at 1:46 PM

Thanks for the response! I just double-checked the InstructGPT paper and you were right regarding the rankings -- they are pairwise, and I am not sure why I thought otherwise.

Regarding the updates on a sentence level, that makes sense. That would be more of a discrete problem as well for which you probably can't backpropagate (otherwise, you would be back to token-level).

was_der_Fall_ist t1_j6xz6wj wrote on February 2, 2023 at 6:08 PM

ChatGPT had labelers rank outputs from best to worst, not head to head. (Different than InstructGPT, maybe?)

“A prompt and several outputs are generated. A labeler ranks the outputs from best to worst.”

https://openai.com/blog/chatgpt/

koolaidman123 t1_j6y07he wrote on February 2, 2023 at 6:14 PM

have you even read the instructGPT paper?

>In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler. In order to speed up comparison collection, we present labelers with anywhere between K = 4 and K = 9 responses to rank. This produces (K C 2 ) comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.5 Instead, we train on all (K C 2 ) comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than (K 2 ) forward passes for K completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss. Specifically, the loss function for the reward model is: loss (θ) = − 1/ (K C 2 ) E(x,yw ,yl )∼D [log (σ (rθ (x, yw) − rθ (x, yl)))] (1) where rθ (x, y) is the scalar output of the reward model for prompt x and completion y with parameters θ, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons.

you know that figure you're referencing comes from the instructgpt paper... right?

[deleted] t1_j6x0948 wrote on February 2, 2023 at 2:23 PM

[deleted]

koolaidman123 t1_j6x2b05 wrote on February 2, 2023 at 2:37 PM

sure? you can have multiple ways of ranking, but:

the instructGPT paper strictly uses pairwise ranking
asking annotators to rank however many passages 1-k in 1 shot is much more difficult and subject to noise than asking for pairwise comparisons

crt09 t1_j6y5x4t wrote on February 2, 2023 at 6:49 PM

This paper seems very relevant: https://arxiv.org/abs/2205.13636 I haven't read it closely enough to give strong opinions with confidence but it seems to beat PPO with a token level loss thats works similar to the Upside Down Reinforcement Learning paper, where you give a target reward between 1 and 5 as an input token before the prompt and train it to output a response of a coressponding quality, trained on the standard LM loss on an existing target output with the given 1-5 reward rank. Then during inference you just append 1 to the start of the prompt and it outputs a response of high quality

mtocrat t1_j6zin88 wrote on February 3, 2023 at 12:04 AM

supervised fine-tuning seems inherently limited here. You regress to the best in the set of answers but that's it. RLHF can improve beyond that, up to the point where the generalization capabilities of the reward model fail..

Jean-Porte t1_j6wvy2p wrote on February 2, 2023 at 1:50 PM

The traditional language modeling loss (negative log-likelihood) is misaligned with human expectations. One negation radically changes the meaning of a sentence. It doesn't radically change the loglikelihood. It isn't more important than a "the" or a superfluous word.

With RLHF, important words have important impact, and the loss is exactly aligned to human interests.

alpha-meta OP t1_j6x1r2j wrote on February 2, 2023 at 2:33 PM

But isn't this only if you train it on the loss (negative log-likelihood) via next-word prediction, i.e., what they do during pretraining?

If you use the ranks (from having users rank the documents) to compute the loss on the instead of the words as labels, would that still be the case?

Jean-Porte t1_j6x8oyx wrote on February 2, 2023 at 3:21 PM

Yes but the LM has to take many steps to produce the text

We need to train the LM to maximize a far-away reward and we need RL to do that

alpha-meta OP t1_j6xylk8 wrote on February 2, 2023 at 6:04 PM

Could you help me understand what the far-away rewards represent here in this context? The steps are generating the individual words? So in this case you mean words that occur early in the text? In this case, a weighting scheme for the cross-entropy loss components could be used?

Jean-Porte t1_j6y0djg wrote on February 2, 2023 at 6:15 PM

The beginning of the best possible answer might not be the best beginning. It's the final outcome, the complete answer that counts, so it makes sense to evaluate that. The reward is the feedback on the complete answer.

alpha-meta OP t1_j6yud7x wrote on February 2, 2023 at 9:21 PM

Ah yes, I see what you mean now, thanks!

_Arsenie_Boca_ t1_j6z24n6 wrote on February 2, 2023 at 10:10 PM

Since it wasnt mentioned so far: RL does not require the loss/reward to be differentiable. This enables us to learn from complete generated sentences (LM sampling is not differentiable) rather than just on token-level

VP4770 t1_j7186vz wrote on February 3, 2023 at 9:55 AM

This

alpha-meta OP t1_j72dpto wrote on February 3, 2023 at 4:08 PM

Good point, so you mean they incorporate things like beam search + changing temperature, top-k sampling, and nucleus sampling in the RL PPO-based optimizaton?

_Arsenie_Boca_ t1_j72g4g4 wrote on February 3, 2023 at 4:23 PM

Im not sure if they vary the sampling hyperparemeters. The point is that langauge modelling objectives are to some degree ill-posed because we calculate the loss on intermediate results rather than the final output that we care about.

wardellinthehouse t1_j6zrsvp wrote on February 3, 2023 at 1:11 AM

I asked this same question: https://www.reddit.com/r/reinforcementlearning/comments/zqfw7r/why_cant_we_do_supervised_learning_in_step_3_of/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button

I believe the answer is due to the fact that sampling from the policy network is a non-differentiable operation.

bigabig t1_j6z3a6d wrote on February 2, 2023 at 10:18 PM

I thought this was also because you do not need so much supervised training data because you 'just' have to train the reward model in a supervised fashion?

alpha-meta OP t1_j72dxx7 wrote on February 3, 2023 at 4:09 PM

I think it's probably the non-differentiable nature of the sampling techniques. If it's just about limited training data and using the reward model, in that case you can also use weakly supervised learning with that reward model.

mtocrat t1_j6zk1ka wrote on February 3, 2023 at 12:14 AM

Let's say your initial model is quite racist and outputs only extremely or moderately racist choices. If you rank those against each other and do supervised training on that dataset you train it to mimic the moderately racist style. You might however plausibly train a model from this that can judge what racism is and extrapolate to judge answers free of it to be even better. Then you optimize with respect to that model to get that style

scraper01 t1_j6zyjd7 wrote on February 3, 2023 at 2:02 AM

The RL loss landscape is richer.

hblarm t1_j712p9u wrote on February 3, 2023 at 8:37 AM

For tasks like summarisation and abstractive question answering, there is no single correct way to phrase the target sequence/answer.

“Some of the cups contained brown liquid” means almost the same as “A few vessels had brown fluid in them”. Now imagine how many different ways you could phrase a 4 paragraph essay on globalisation.

In SL, the model is forced to learn the precise answer you feed it, and metrics like ROUGE penalise the use of synonyms. This causes models to perform badly when testing for human preference. The only reliable way to train/evaluate a model to impress humans is to directly incorporate human preferences into training.

This doesn’t lend itself to SL very well, due to the unlimited possible phrasings of sentences, so instead the authors train a reward function that can estimate human preference, and use RL to update model weights to create better and better predictions. Any valid, nicely written phrasing will now get a good score.

Importantly, the model they start with is almost SOTA on the summarisation tasks they are learning. So RL can take them further and further towards human preferences.

In a nutshell, RL allows human preference to be trained on directly, which allows the model to exhibit remarkably creativity.

plocco-tocco t1_j6yc79c wrote on February 2, 2023 at 7:29 PM

I would also like to know from anyone who might have a clue, can RLHF offer any significant boost to machine translation to offer better language-to-language translation?

gamerx88 t1_j70rs5v wrote on February 3, 2023 at 6:19 AM

Without referring to the paper again, my intuition is that a pairwise loss over final outputs does not gel well with how the model is auto-regressively generating the text.

Generation with GPT is basically a token by token decoding process with the previous time steps taken into account. Think about the difference between a supervised learning problem vs reinforcement learning. The former ignores the step-by-step nature of the generation scheme, and is a poorer fit for a decoding problem.

prototypist t1_j71p3d6 wrote on February 3, 2023 at 1:12 PM

You can fine-tune language models on a dataset, and that's essentially how people have been typically doing NLP with transformers models? It's more recent that research has been having success with RL for these kinds of tasks. So whatever rationale and answers you get here, the main reason is that they were doing supervised learning before and the RL people started getting better results.

blimpyway t1_j742oes wrote on February 3, 2023 at 10:37 PM

I guess the point of the reward model is to approximate human feedback and instead of hiring humans to actually rank (e.g.) 1billion chats needed to update the LLM, train a reward model with 1% of them then use it to simulate human evaluators 99% of the times.