crt09 t1_j6y5x4t wrote on February 2, 2023 at 6:49 PM

Reply to comment by koolaidman123 in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

This paper seems very relevant: https://arxiv.org/abs/2205.13636 I haven't read it closely enough to give strong opinions with confidence but it seems to beat PPO with a token level loss thats works similar to the Upside Down Reinforcement Learning paper, where you give a target reward between 1 and 5 as an input token before the prompt and train it to output a response of a coressponding quality, trained on the standard LM loss on an existing target output with the given 1-5 reward rank. Then during inference you just append 1 to the start of the prompt and it outputs a response of high quality