Training language models to follow instructions with human feedback

Created on 2023-04-16T01:04:16-05:00

Authors: Long Ouyang and Jeff Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul Christiano and Jan Leike and Ryan Lowe

Year: 2022.

Contractors are given the results of a prompt->response process to select a winner
"Reward model" is trained to select which response the human selected
Meta-reward system is created with Proximal Policy Optimization to "fine tune" the GPT model towards outputs from the reward model

Resulting network is referred to as "InstructGPT."