Submitted by alpha-meta t3_10rpj0f in MachineLearning
Aligned LLMs such as InstructGPT and ChatGPT are trained via supervised fine-tuning after the initial self-supervised pretraining. Then, the researchers train a reward model on responses ranked by humans.
When I understand correctly, they let the LLM generate responses that humans have to rank on a scale from 1-5. Then, they train a reward model (I suppose in supervised fashion?) on these ranked outputs. Once that's done, they use reinforcement learning (RL) with proximal policy optimization (PPO) to update the LLM.
My question is why they use RL with PPO for this last step? Why don't they fine-tune the LLM using regular supervised learning, whereas the human-ranked outputs represent the labels. Since these are labels in the range 1-5, this could be a ranking or ordinal regression loss for supervised learning.
koolaidman123 t1_j6wtmdj wrote
As to why rl outperform sft, not a lot of orgs have the resources to test this (yet), I've heard a plausible theory from ai2 that the main difference comes from the fact that sft uses a token level loss, whereas rl loss takes the entire sentence, so maybe instead of rl being "better" its just next token prediction task is worse
Reseachers ive spoken with dont believe rl is the critical component to enable these models, and that we could eventually discover the right training regime to enable sft to perform on par (or better) than rl