Training language models to follow instructions with human feedback
Created on 2023-04-16T01:04:16-05:00
Authors: Long Ouyang and Jeff Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul Christiano and Jan Leike and Ryan Lowe
Year: 2022.
- Contractors are given the results of a prompt->response process to select a winner
- "Reward model" is trained to select which response the human selected
- Meta-reward system is created with Proximal Policy Optimization to "fine tune" the GPT model towards outputs from the reward model
Resulting network is referred to as "InstructGPT."