hblarm t1_j712p9u wrote on February 3, 2023 at 8:37 AM

For tasks like summarisation and abstractive question answering, there is no single correct way to phrase the target sequence/answer.

“Some of the cups contained brown liquid” means almost the same as “A few vessels had brown fluid in them”. Now imagine how many different ways you could phrase a 4 paragraph essay on globalisation.

In SL, the model is forced to learn the precise answer you feed it, and metrics like ROUGE penalise the use of synonyms. This causes models to perform badly when testing for human preference. The only reliable way to train/evaluate a model to impress humans is to directly incorporate human preferences into training.

This doesn’t lend itself to SL very well, due to the unlimited possible phrasings of sentences, so instead the authors train a reward function that can estimate human preference, and use RL to update model weights to create better and better predictions. Any valid, nicely written phrasing will now get a good score.

Importantly, the model they start with is almost SOTA on the summarisation tasks they are learning. So RL can take them further and further towards human preferences.

In a nutshell, RL allows human preference to be trained on directly, which allows the model to exhibit remarkably creativity.