was_der_Fall_ist t1_j6xz6wj wrote
Reply to comment by koolaidman123 in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
ChatGPT had labelers rank outputs from best to worst, not head to head. (Different than InstructGPT, maybe?)
“A prompt and several outputs are generated. A labeler ranks the outputs from best to worst.”
koolaidman123 t1_j6y07he wrote
have you even read the instructGPT paper?
>In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler. In order to speed up comparison collection, we present labelers with anywhere between K = 4 and K = 9 responses to rank. This produces (K C 2 ) comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.5 Instead, we train on all (K C 2 ) comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than (K 2 ) forward passes for K completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss. Specifically, the loss function for the reward model is: loss (θ) = − 1/ (K C 2 ) E(x,yw ,yl )∼D [log (σ (rθ (x, yw) − rθ (x, yl)))] (1) where rθ (x, y) is the scalar output of the reward model for prompt x and completion y with parameters θ, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons.
you know that figure you're referencing comes from the instructgpt paper... right?
Viewing a single comment thread. View all comments