gamerx88 t1_j70rs5v wrote
Without referring to the paper again, my intuition is that a pairwise loss over final outputs does not gel well with how the model is auto-regressively generating the text.
Generation with GPT is basically a token by token decoding process with the previous time steps taken into account. Think about the difference between a supervised learning problem vs reinforcement learning. The former ignores the step-by-step nature of the generation scheme, and is a poorer fit for a decoding problem.
Viewing a single comment thread. View all comments