Training language models to follow instructions with human feedback

ChapterPal

Long Ouyang$^*$

Jeff Wu$^*$

Xu Jiang$^*$

Diogo Almeida$^*$

Carroll L. Wainwright$^*$

Pamela Mishkin$^*$

Chong Zhang

Sandhini Agarwal

Katarina Slama

Alex Ray

John Schulman

Jacob Hilton

Fraser Kelton

Luke Miller

Maddie Simens

Amanda Askell$^\dagger$

Peter Welinder

Paul Christiano$^{*\dagger}$

Jan Leike$^*$

Ryan Lowe$^*$

OpenAI

$^*$Primary authors. This was a joint project of the OpenAI Alignment team. RL and JL are the team leads. Corresponding author: [email protected].

$^\dagger$Work done while at OpenAI. Current affiliations: AA: Anthropic; PC: Alignment Research Center.

Abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

Executive Summary: Large language models like GPT-3 have transformed how we process and generate text, powering applications from chatbots to content creation. However, these models often fail to align with user needs: they invent facts, produce biased or toxic outputs, and ignore instructions, leading to unreliable results in real-world use. This misalignment poses risks as models grow larger and more widespread, potentially amplifying misinformation or harm in deployed systems. Addressing it now is critical to ensure AI tools benefit users safely and effectively.

This paper evaluates a method to better align language models with human intent, focusing on making them helpful (following instructions accurately), honest (avoiding falsehoods), and harmless (reducing bias and toxicity). Researchers at OpenAI tested fine-tuning GPT-3 on a broad set of user prompts to demonstrate improved behavior across diverse tasks.

The approach involved three main steps using data from about 40 trained human labelers and prompts submitted to OpenAI's API (covering tasks like writing, question-answering, and summarization, mostly in English). First, labelers created example responses to prompts, used to fine-tune GPT-3 via supervised learning, producing baseline models. Second, labelers ranked multiple model outputs per prompt, training a smaller model to predict preferences as a reward signal. Third, reinforcement learning optimized the baseline models against this reward, incorporating techniques to limit deviations from original training data. Models were trained at scales of 1.3 billion, 6 billion, and 175 billion parameters; key assumptions included prioritizing labeler preferences over broader values and focusing on API-like tasks, with data collected over several months excluding sensitive user details.

The most important results show that the resulting InstructGPT models outperform GPT-3 in human evaluations. Outputs from the smallest 1.3B InstructGPT were preferred over the much larger 175B GPT-3 about 58% of the time on held-out prompts, even when GPT-3 used few-shot examples to guide it—demonstrating alignment's efficiency without needing massive scale. Truthfulness improved markedly: InstructGPT hallucinated facts half as often (21% vs. 41% rate) on tasks requiring input-based responses, and gave accurate answers twice as frequently on a benchmark of misleading questions. Toxicity dropped by about 25% when prompted respectfully, though bias metrics showed little change. Public benchmarks for tasks like question-answering and translation saw small regressions (e.g., 5-10% drops), but modifying the training by blending in original pretraining data largely reversed these without harming alignment gains. Models also generalized somewhat to new labelers (similar preference rates) and rare tasks like non-English instructions or code summarization.

These findings matter because they prove human feedback can make models safer and more reliable at a fraction of the cost of scaling up—training the 175B InstructGPT used just 2% of GPT-3's compute while yielding better user satisfaction. This reduces risks like spreading falsehoods in education or advice apps, and eases deployment by minimizing unintended harms. Unexpectedly, public NLP datasets (used in prior work) underperformed compared to API prompts, suggesting real-user tasks demand broader training. Overall, alignment boosts performance where it counts but highlights that bigger isn't always better for intent-following.

Next, organizations should adopt and iterate on this human-feedback fine-tuning for production models, starting with pilots on high-risk applications like customer support. Key actions include expanding labeler diversity for broader representation, combining with data filtering to cut toxicity further, and testing refusals for harmful requests (e.g., biased content). Trade-offs: more diverse feedback raises costs but improves fairness; aggressive safety tweaks might limit creativity in safe uses. Before full rollout, conduct targeted studies on edge cases like multilingual or high-stakes domains.

Limitations include reliance on a non-representative labeler group (mostly US/Southeast Asian English speakers), leading to potential cultural biases, and incomplete safety—models still followed harmful instructions 10-20% more toxically when prompted to do so. Data gaps exist for non-English tasks (under 4% of prompts), and regressions persist on some benchmarks despite fixes. Confidence is high in preference and truthfulness gains (backed by 95% intervals from thousands of evaluations), but moderate for toxicity/bias due to proxy metrics; caution is needed for underrepresented groups or adversarial inputs.

1 Introduction

Section Summary: Large language models like GPT-3 can handle various language tasks when given examples, but they often produce unintended issues such as fabricating facts, showing bias, or ignoring instructions because their core training goal—predicting web text—doesn't align with being helpful and safe for users. To address this, the researchers fine-tune these models using reinforcement learning from human feedback, where people rank outputs to train models that are helpful, honest, and harmless, resulting in the InstructGPT series. Their tests show that even smaller InstructGPT models outperform the much larger GPT-3 in following instructions and generating truthful responses, with modest gains in reducing toxicity but little improvement in bias, while minimizing drops in performance on standard language benchmarks.

Large language models (LMs) can be "prompted" to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions ([1, 2, 3, 4, 5, 6]). This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective "follow the user's instructions helpfully and safely" ([7, 8, 9, 10, 11]). Thus, we say that the language modeling objective is misaligned. Averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications.

We make progress on aligning language models by training them to act in accordance with the user's intention ([12]). This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful. Using the language of [13], we want language models to be helpful (they should help the user solve their task), honest (they shouldn't fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment). We elaborate on the evaluation of these criteria in Section 3.6.

We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; [14, 15]) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API[^2] and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm ([16]). We illustrate this process in Figure 2. This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of "human values"; we discuss this further in Section 5.2. We call the resulting models InstructGPT.

[^2]: Specifically, we train on prompts submitted to earlier versions of the InstructGPT models on the OpenAI API Playground, which were trained only using demonstration data. We filter out prompts containing PII.

We mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers (who are not represented in the training data). We also conduct automatic evaluations on a range of public NLP datasets. We train three model sizes (1.3B, 6B, and 175B parameters), and all of our models use the GPT-3 architecture. Our main findings are as follows:

Labelers significantly prefer InstructGPT outputs over outputs from GPT-3.

On our test set, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having over 100x fewer parameters. These models have the same architecture, and differ only by the fact that InstructGPT is fine-tuned on our human data. This result holds true even when we add a few-shot prompt to GPT-3 to make it better at following instructions. Outputs from our 175B InstructGPT are preferred to 175B GPT-3 outputs 85 $\pm$ 3% of the time, and preferred 71 $\pm$ 4% of the time to few-shot 175B GPT-3. InstructGPT models also generate more appropriate outputs according to our labelers, and more reliably follow explicit constraints in the instruction.

InstructGPT models show improvements in truthfulness over GPT-3.

On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3. Our results are equally strong on the subset of questions that were not adversarially selected against GPT-3. On "closed-domain" tasks from our API prompt distribution, where the output should not contain information that is not present in the input (e.g. summarization and closed-domain QA), InstructGPT models make up information not present in the input about half as often as GPT-3 (a 21% vs. 41% hallucination rate, respectively).

InstructGPT shows small improvements in toxicity over GPT-3, but not bias.

To measure toxicity, we use the RealToxicityPrompts dataset ([6]) and conduct both automatic and human evaluations. InstructGPT models generate about 25% fewer toxic outputs than GPT-3 when prompted to be respectful. InstructGPT does not significantly improve over GPT-3 on the Winogender ([17]) and CrowSPairs ([18]) datasets.

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.

During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD ([19]), DROP ([20]), HellaSwag ([21]), and WMT 2015 French to English translation ([22]). This is an example of an "alignment tax" since our alignment procedure comes at the cost of lower performance on certain tasks that we may care about. We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx), without compromising labeler preference scores.

Our models generalize to the preferences of "held-out" labelers that did not produce any training data.

To test the generalization of our models, we conduct a preliminary experiment with held-out labelers, and find that they prefer InstructGPT outputs to outputs from GPT-3 at about the same rate as our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.

Public NLP datasets are not reflective of how our language models are used.

We compare GPT-3 fine-tuned on our human preference data (i.e. InstructGPT) to GPT-3 fine-tuned on two different compilations of public NLP tasks: the FLAN ([23]) and T0 ([24]) (in particular, the T0++ variant). These datasets consist of a variety of NLP tasks, combined with natural language instructions for each task. On our API prompt distribution, our FLAN and T0 models perform slightly worse than our SFT baseline, and labelers significantly prefer InstructGPT to these models (InstructGPT has a 73.4 $\pm 2%$ winrate vs. our baseline, compared to 26.8 $\pm 2%$ and 29.8 $\pm 2%$ for our version of T0 and FLAN, respectively).

InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution.

We qualitatively probe InstructGPT's capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution. In contrast, GPT-3 can perform these tasks but requires more careful prompting, and does not usually follow instructions in these domains. This result is exciting because it suggests that our models are able to generalize the notion of "following instructions." They retain some alignment even on tasks for which they get very little direct supervision signal.

InstructGPT still makes simple mistakes.

For example, InstructGPT can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises.

Overall, our results indicate that fine-tuning large language models using human preferences significantly improves their behavior on a wide range of tasks, though much work remains to be done to improve their safety and reliability.

The rest of this paper is structured as follows: We first detail related work in Section 2, before diving into our method and experiment details in Section 3, including our high-level methodology (Section 3.1), task and dataset details (Section 3.3 and Section 3.2), human data collection (Section 3.4), how we trained our models (Section 3.5), and our evaluation procedure (Section 3.6). We then present our results in Section 4, divided into three parts: results on the API prompt distribution (Section 4.1), results on public NLP datasets (Section 4.2), and qualitative results (Section 4.3). Finally we give an extended discussion of our work in Section 5, including implications for alignment research (Section 5.1), what we are aligning to (Section 5.2), limitations (Section 5.3), open questions (Section 5.4), and broader impacts of this work (Section 5.5).

2 Related work

Section Summary: This section reviews prior research on aligning language models with human goals, building on techniques like reinforcement learning from human feedback, which started with robots and games but now helps models summarize text and handle tasks like dialogue or translation more safely and effectively. It also covers training models to follow instructions across various natural language processing tasks for better performance on new challenges, including navigation in simulated worlds, while highlighting documented risks such as biased outputs, data leaks, misinformation, and toxicity that can arise from deploying these models. Finally, it discusses methods to reduce these harms, from fine-tuning on targeted data and filtering training sets to adding safety controls during generation and using additional models to steer outputs away from problematic content.

Research on alignment and learning from human feedback.

We build on previous techniques to align models with human intentions, particularly reinforcement learning from human feedback (RLHF). Originally developed for training simple robots in simulated environments and Atari games ([14, 25]), it has recently been applied to fine-tuning language models to summarize text ([26, 15, 27, 28]). This work is in turn influenced by similar work using human feedback as a reward in domains such as dialogue ([29, 30, 31]), translation ([32, 33]), semantic parsing ([34]), story generation ([35]), review generation ([36]), and evidence extraction ([37]). [38] use written human feedback to augment prompts and improve the performance of GPT-3. There has also been work on aligning agents in text-based environments using RL with a normative prior ([39]). Our work can be seen as a direct application of RLHF to aligning language models on a broad distribution of language tasks.

The question of what it means for language models to be aligned has also received attention recently ([40]). [3] catalog behavioral issues in LMs that result from misalignment, including producing harmful content and gaming misspecified objectives. In concurrent work, [13] propose language assistants as a testbed for alignment research, study some simple baselines, and their scaling properties.

Training language models to follow instructions.

Our work is also related to research on cross-task generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain ([30, 41, 23, 42, 24, 43]), which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, with instructions, improves their downstream performance on held-out tasks, both in the zero-shot and few-shot settings.

There is also a related line of work on instruction following for navigation, where models are trained to follow natural language instructions to navigate in a simulated environment ([44, 45, 46]).

Evaluating the harms of language models.

A goal of modifying the behavior of language models is to mitigate the harms of these models when they're deployed in the real world. These risks have been extensively documented ([1, 2, 3, 4, 5]). Language models can produce biased outputs ([47, 48, 49, 50, 51]), leak private data ([52]), generate misinformation ([53, 54]), and be used maliciously; for a thorough review we direct the reader to [4]. Deploying language models in specific domains gives rise to new risks and challenges, for example in dialog systems ([55, 56, 57]). There is a nascent but growing field that aims to build benchmarks to concretely evaluate these harms, particularly around toxicity ([6]), stereotypes ([58]), and social bias ([47, 18, 17]). Making significant progress on these problems is hard since well-intentioned interventions on LM behavior can have side-effects ([59, 60]); for instance, efforts to reduce the toxicity of LMs can reduce their ability to model text from under-represented groups, due to prejudicial correlations in the training data ([61]).

Modifying the behavior of language models to mitigate harms.

There are many ways to change the generation behavior of language models. [62] fine-tune LMs on a small, value-targeted dataset, which improves the models' ability to adhere to these values on a question answering task. [63] filter the pretraining dataset by removing documents on which a language model has a high conditional likelihood of generating a set of researcher-written trigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the cost of a slight decrease in language modeling performance. [56] use a variety of approaches to improve the safety of chatbots, including data filtering, blocking certain words or n-grams during generation, safety-specific control tokens ([64, 65]), and human-in-the-loop data collection ([57]). Other approaches for mitigating the generated bias by LMs use word embedding regularization ([66, 67]), data augmentation ([66, 65, 68]), null space projection to make the distribution over sensitive tokens more uniform ([48]), different objective functions ([69]), or causal mediation analysis ([70]). There is also work on steering the generation of language models using a second (usually smaller) language model ([71, 72]), and variants of this idea have been applied to reducing language model toxicity ([73]).

3 Methods and experimental details

Section Summary: The researchers outline a three-step method to fine-tune language models for better alignment with human preferences, starting with human labelers creating example responses to prompts to train a basic supervised model, followed by collecting pairwise comparisons to build a reward model that scores outputs, and finally using reinforcement learning (PPO) to refine the model against that reward, with the process repeatable for improvement. Their dataset draws mainly from user-submitted prompts to an early version of OpenAI's InstructGPT model via a public interface, supplemented by labeler-written prompts for diversity, with splits totaling around 13,000 for supervised training, 33,000 for reward modeling, and 31,000 for reinforcement learning, while filtering out personal information and focusing on English-language tasks. The prompts cover a wide range of activities like story generation, question answering, and summarization, often presented as direct instructions, few-shot examples, or partial continuations, ensuring the model handles varied natural language needs.

3.1 High-level methodology

Our methodology follows that of [26] and [15], who applied it in the stylistic continuation and summarization domains. We start with a pretrained language model ([7, 8, 9, 10, 11]), a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers (see Section 3.4 for details). We then apply the following three steps (Figure 2).

Step 1: Collect demonstration data, and train a supervised policy.

Our labelers provide demonstrations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning.

Step 2: Collect comparison data, and train a reward model.

We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.

Step 3: Optimize a policy against the reward model using PPO.

We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm ([16]).

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

:::

Table 1: Distribution of use case categories from our API prompt dataset.

:::

3.2 Dataset

Our prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface.[^3] Customers using the Playground were informed that their data could be used to train further models via a recurring notification any time InstructGPT models were used. In this paper we do not use data from customers using the API in production. We heuristically deduplicate prompts by checking for prompts that share a long common prefix, and we limit the number of prompts to 200 per user ID. We also create our train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. To avoid the models learning potentially sensitive customer details, we filter all prompts in the training split for personally identifiable information (PII).

[^3]: This is an interface hosted by OpenAI to interact directly with models on our API; see https://beta.openai.com/playground.

To train the very first InstructGPT models, we asked labelers to write prompts themselves. This is because we needed an initial source of instruction-like prompts to bootstrap the process, and these kinds of prompts weren't often submitted to the regular GPT-3 models on the API. We asked labelers to write three kinds of prompts:

Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity.
Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction.
User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

From these prompts, we produce three different datasets used in our fine-tuning procedure: (1) our SFT dataset, with labeler demonstrations used to train our SFT models, (2) our RM dataset, with labeler rankings of model outputs used to train our RMs, and (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API). More details on dataset sizes are provided in Table 6.

To give a sense of the composition of our dataset, in Table 1 we show the distribution of use-case categories for our API prompts (specifically the RM dataset) as labeled by our contractors. Most of the use-cases have are generative, rather than classification or QA. We also show some illustrative prompts (written by researchers to mimic the kinds of prompts submitted to InstructGPT models) in Table 2; more prompts submitted to InstructGPT models are shown in Appendix A.2.1, and prompts submitted to GPT-3 models are shown in Appendix A.2.2. We provide more details about our dataset in Appendix A.

3.3 Tasks

Our training tasks are from two sources: (1) a dataset of prompts written by our labelers and (2) a dataset of prompts submitted to early InstructGPT models on our API (see Table 6). These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Table 1). Our dataset is over 96% English, however in Section 4.3 we also probe our model's ability to respond to instructions in other languages and complete coding tasks.

For each natural language prompt, the task is most often specified directly through a natural language instruction (e.g. "Write a story about a wise frog"), but could also be indirectly through either few-shot examples (e.g. giving two examples of frog stories, and prompting the model to generate a new one) or implicit continuation (e.g. providing the start of a story about a frog). In each case, we ask our labelers to do their best to infer the intent of the user who wrote the prompt, and ask them to skip inputs where the task is very unclear. Moreover, our labelers also take into account the implicit intentions such as truthfulness of the response, and potentially harmful outputs such as biased or toxic language, guided by the instructions we provide them (see Appendix B) and their best judgment.

3.4 Human data collection

To produce our demonstration and comparison data, and to conduct our main evaluations, we hired a team of about 40 contractors on Upwork and through ScaleAI. Compared to earlier work that collects human preference data on the task of summarization ([26, 15, 28]), our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics. Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. Thus, we conducted a screening test designed to measure labeler performance on these axes. We selected labelers who performed well on this test; for more information about our selection procedure and labeler demographics, see Appendix B.1.

During training and evaluation, our alignment criteria may come into conflict: for example, when a user requests a potentially harmful response. During training we prioritize helpfulness to the user (not doing so requires making some difficult design decisions that we leave to future work; see Section 5.4 for more discussion). However, in our final evaluations we asked labelers prioritize truthfulness and harmlessness (since this is what we really care about).

As in [15], we collaborate closely with labelers over the course of the project. We have an onboarding process to train labelers on the project, write detailed instructions for each task (see Appendix B.2), and answer labeler questions in a shared chat room.

As an initial study to see how well our model generalizes to the preferences of other labelers, we hire a separate set of labelers who do not produce any of the training data. These labelers are sourced from the same vendors, but do not undergo a screening test.

Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other $72.6 \pm 1.5%$ of the time, while for held-out labelers this number is $77.3 \pm 1.3%$. For comparison, in the summarization work of [15] researcher-researcher agreement was $73 \pm 4%$.

3.5 Models

We start with the GPT-3 pretrained language models from [8]. These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior. Starting from these models, we then train models with three different techniques:

Supervised fine-tuning (SFT).

We fine-tune GPT-3 on our labeler demonstrations using supervised learning. We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2. We do our final SFT model selection based on the RM score on the validation set. Similarly to [28], we find that our SFT models overfit on validation loss after 1 epoch; however, we find that training for more epochs helps both the RM score and human preference ratings, despite this overfitting.

Reward modeling (RM).

Starting from the SFT model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix C for more details).

In [15], the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.

In order to speed up comparison collection, we present labelers with anywhere between $K=4$ and $K=9$ responses to rank. This produces ${K \choose 2}$ comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.[^4] Instead, we train on all ${K \choose 2}$ comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than ${K \choose 2}$ forward passes for $K$ completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss.

[^4]: That is, if each of the possible ${K \choose 2}$ comparisons is treated as a separate data point, then each completion will potentially be used for $K-1$ separate gradient updates. The model tends to overfit after a single epoch, so repeating data within an epoch also causes it to overfit.

Specifically, the loss function for the reward model is:

$ \begin{split} \operatorname{loss}\left(\theta \right)=-\frac{1}{{K \choose 2}}E_{\left(x, y_{w}, y_{l}\right) \sim D}\left[\log \left(\sigma\left(r_{\theta}\left(x, y_{w}\right)-r_{\theta}\left(x, y_{l}\right)\right)\right)\right] \end{split}\tag{1} $

where $ r_{\theta}(x, y) $ is the scalar output of the reward model for prompt $ x $ and completion $ y $ with parameters $ \theta $, $y_{w}$ is the preferred completion out of the pair of $y_{w}$ and $y_{l}$, and $D$ is the dataset of human comparisons.

Finally, since the RM loss is invariant to shifts in reward, we normalize the reward model using a bias so that the labeler demonstrations achieve a mean score of 0 before doing RL.

:Table 3: Labeler-collected metadata on the API distribution.

Metadata	Scale
Overall quality	Likert scale; 1-7
Fails to follow the correct instruction / task	Binary
Inappropriate for customer assistant	Binary
Hallucination	Binary
Satisifies constraint provided in the instruction	Binary
Contains sexual content	Binary
Contains violent content	Binary
Encourages or fails to discourage violence/abuse/terrorism/self-harm	Binary
Denigrates a protected class	Binary
Gives harmful advice	Binary
Expresses opinion	Binary
Expresses moral judgment	Binary

Reinforcement learning (RL).

Once again following [15], we fine-tuned the SFT model on our environment using PPO ([16]). The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model. The value function is initialized from the RM. We call these models "PPO."

We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets. We call these models "PPO-ptx." We maximize the following combined objective function in RL training:

$ \begin{split} \operatorname{objective}\left(\phi\right)= & E_{\left(x, y\right) \sim D_{\pi_{\phi}^{\mathrm{RL}}}}\left[r_{\theta}(x, y)-\beta \log \left(\pi_{\phi}^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right] + \ & \gamma E_{x \sim D_\textrm{pretrain}}\left[\log(\pi_{\phi}^{\mathrm{RL}}(x))\right] \end{split}\tag{2} $

where $ \pi_{\phi}^{\mathrm{RL}}$ is the learned RL policy, $ \pi^{\mathrm{SFT}}$ is the supervised trained model, and $D_\textrm{pretrain} $ is the pretraining distribution. The KL reward coefficient, $ \beta $, and the pretraining loss coefficient, $ \gamma $, control the strength of the KL penalty and pretraining gradients respectively. For "PPO" models, $ \gamma $ is set to 0. Unless otherwise specified, in this paper InstructGPT refers to the PPO-ptx models.

Baselines.

We compare the performance of our PPO models to our SFT models and GPT-3. We also compare to GPT-3 when it is provided a few-shot prefix to ‘prompt' it into an instruction-following mode (GPT-3-prompted). This prefix is prepended to the user-specified instruction.[^5]

[^5]: To obtain this prefix, authors RL and DA held a prefix-finding competition: each spent an hour interacting with GPT-3 to come up with their two best prefixes. The winning prefix was the one that led GPT-3 to attain the highest RM score on the prompt validation set. DA won.

We additionally compare InstructGPT to fine-tuning 175B GPT-3 on the FLAN ([23]) and T0 ([24]) datasets, which both consist of a variety of NLP tasks, combined with natural language instructions for each task (the datasets differ in the NLP datasets included, and the style of instructions used). We fine-tune them on approximately 1 million examples respectively and choose the checkpoint which obtains the highest reward model score on the validation set. See Appendix C for more training details.

3.6 Evaluation

To evaluate how "aligned" our models are, we first need to clarify what alignment means in this context. The definition of alignment has historically been a vague and confusing topic, with various competing proposals ([74, 12, 40]). Following [12], our aim is to train models that act in accordance with user intentions. More practically, for the purpose of our language tasks, we use a framework similar to [13], who define models to be aligned if they are helpful, honest, and harmless.

To be helpful, the model should follow instructions, but also infer intention from a few-shot prompt or another interpretable pattern such as "Q: question\nA:". Since a given prompt's intention can be unclear or ambiguous, we rely on judgment from our labelers, and our main metric is labeler preference ratings. However, since our labelers are not the users who generated the prompts, there could be a divergence between what a user actually intended and what the labeler thought was intended from only reading the prompt.

It is unclear how to measure honesty in purely generative models; this requires comparing the model's actual output to its "belief" about the correct output, and since the model is a big black box, we can't infer its beliefs. Instead, we measure truthfulness—whether the model's statements about the world are true—using two metrics: (1) evaluating our model's tendency to make up information on closed domain tasks ("hallucinations"), and (2) using the TruthfulQA dataset ([75]). Needless to say, this only captures a small part of what is actually meant by truthfulness.

Similarly to honesty, measuring the harms of language models also poses many challenges. In most cases, the harms from language models depend on how their outputs are used in the real world. For instance, a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might even be helpful if used for data augmentation to train a more accurate toxicity detection model. Earlier in the project, we had labelers evaluate whether an output was ‘potentially harmful'. However, we discontinued this as it required too much speculation about how the outputs would ultimately be used; especially since our data also comes from customers who interact with the Playground API interface (rather than from production use cases).

Therefore we use a suite of more specific proxy criteria that aim to capture different aspects of behavior in a deployed model that could end up being harmful: we have labelers evaluate whether an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content. We also benchmark our model on datasets intended to measure bias and toxicity, such as RealToxicityPrompts ([6]) and CrowS-Pairs ([18]).

To summarize, we can divide our quantitative evaluations into two separate parts:

Evaluations on API distribution.

Our main metric is human preference ratings on a held out set of prompts from the same source as our training distribution. When using prompts from the API for evaluation, we only select prompts by customers we haven't included in training. However, given that our training prompts are designed to be used with InstructGPT models, it's likely that they disadvantage the GPT-3 baselines. Thus, we also evaluate on prompts submitted to GPT-3 models on the API; these prompts are generally not in an `instruction following' style, but are designed specifically for GPT-3. In both cases, for each model we calculate how often its outputs are preferred to a baseline policy; we choose our 175B SFT model as the baseline since its performance is near the middle of the pack. Additionally, we ask labelers to judge the overall quality of each response on a 1-7 Likert scale and collect a range of metadata for each model output (see Table 3).

Evaluations on public NLP datasets.

We evaluate on two types of public datasets: those that capture an aspect of language model safety, particularly truthfulness, toxicity, and bias, and those that capture zero-shot performance on traditional NLP tasks like question answering, reading comprehension, and summarization. We also conduct human evaluations of toxicity on the RealToxicityPrompts dataset ([6]). We are releasing samples from our models on all of the sampling-based NLP tasks.[^6]

[^6]: Accessible here: https://github.com/openai/following-instructions-human-feedback.

4 Results

Section Summary: The results demonstrate that InstructGPT models are significantly preferred by human labelers over GPT-3 outputs when tested on real API prompts, with improvements stemming from techniques like few-shot prompting, supervised fine-tuning, and reinforcement learning, making the outputs more reliable, instruction-following, and less prone to hallucinations or errors. These models also generalize well to preferences from labelers not involved in training and outperform fine-tuned versions of GPT-3 on public datasets like FLAN and T0, which fail to capture the diversity of actual user tasks such as open-ended generation. On benchmarks like TruthfulQA, InstructGPT shows modest gains in producing truthful and informative responses compared to GPT-3.

In this section, we provide experimental evidence for our claims in Section 1, sorted into three parts: results on the API prompt distribution, results on public NLP datasets, and qualitative results.

4.1 Results on the API distribution

Labelers significantly prefer InstructGPT outputs over outputs from GPT-3.

On our test set of prompts, our labelers significantly prefer InstructGPT outputs across model sizes. These results are shown in Figure 1. We find that GPT-3 outputs perform the worst, and one can obtain significant step-size improvements by using a well-crafted few-shot prompt (GPT-3 (prompted)), then by training on demonstrations using supervised learning (SFT), and finally by training on comparison data using PPO. Adding updates on the pretraining mix during PPO does not lead to large changes in labeler preference. To illustrate the magnitude of our gains: when compared directly, 175B InstructGPT outputs are preferred to GPT-3 outputs 85 $\pm$ 3% of the time, and preferred 71 $\pm$ 4% of the time to few-shot GPT-3.

We also found that our results do not change significantly when evaluated on prompts submitted to GPT-3 models on the API (see Figure 3), though our PPO-ptx models perform slightly worse at larger model sizes.

In Figure 4 we show that labelers also rate InstructGPT outputs favorably along several more concrete axes. Specifically, compared to GPT-3, InstructGPT outputs are more appropriate in the context of a customer assistant, more often follow explicit constraints defined in the instruction (e.g. "Write your answer in 2 paragraphs or less."), are less likely to fail to follow the correct instruction entirely, and make up facts ('hallucinate') less often in closed-domain tasks. These results suggest that InstructGPT models are more reliable and easier to control than GPT-3. We've found that our other metadata categories occur too infrequently in our API to obtain statistically significant differences between our models.

Our models generalize to the preferences of "held-out" labelers that did not produce any training data.

Held-out labelers have similar ranking preferences as workers who we used to produce training data (see Figure 3). In particular, according to held-out workers, all of our InstructGPT models still greatly outperform the GPT-3 baselines. Thus, our InstructGPT models aren't simply overfitting to the preferences of our training labelers.

We see further evidence of this from the generalization capabilities of our reward models. We ran an experiment where we split our labelers into 5 groups, and train 5 RMs (with 3 different seeds) using 5-fold cross validation (training on 4 of the groups, and evaluating on the held-out group). These RMs have an accuracy of 69.6 $\pm$ 0.9% on predicting the preferences of labelers in the held-out group, a small decrease from their 72.4 $\pm$ 0.4% accuracy on predicting the preferences of labelers in their training set.

**Figure 5:** Comparing our models with FLAN and T0 in terms of Likert scores on a 1-7 scale, on the InstructGPT prompt distribution. FLAN and T0 perform better than default GPT-3, and comparably with a few-shot GPT-3 model placed into 'instruction-following' mode.

Public NLP datasets are not reflective of how our language models are used.

In Figure 5, we also compare InstructGPT to our 175B GPT-3 baselines fine-tuned on the FLAN ([23]) and T0 ([24]) datasets (see Appendix C for details). We find that these models perform better than GPT-3, on par with GPT-3 with a well-chosen prompt, and worse than our SFT baseline. This indicates that these datasets are not sufficiently diverse to improve performance on our API prompt distribution. In a head to head comparison, our 175B InstructGPT model outputs were preferred over our FLAN model 78 $\pm$ 4% of the time and over our T0 model 79 $\pm$ 4% of the time. Likert scores for these models are shown in Figure 5.

We believe our InstructGPT model outperforms FLAN and T0 for two reasons. First, public NLP datasets are designed to capture tasks that are easy to evaluate with automatic metrics, such as classification, question answering, and to a certain extent summarization and translation. However, classification and QA are only a small part (about 18%) of what API customers use our language models for, whereas open-ended generation and brainstorming consist of about 57% of our prompt dataset according to labelers (see Table 1). Second, it can be difficult for public NLP datasets to obtain a very high diversity of inputs (at least, on the kinds of inputs that real-world users would be interested in using). Of course, tasks found in NLP datasets do represent a kind of instruction that we would like language models to be able to solve, so the broadest type instruction-following model would combine both types of datasets.

4.2 Results on public NLP datasets

InstructGPT models show improvements in truthfulness over GPT-3.

As measured by human evaluatoins on the TruthfulQA dataset, our PPO models show small but significant improvements in generating truthful and informative outputs compared to GPT-3 (see Figure 6). This behavior is the default: our models do not have to be specifically instructed to tell the truth to exhibit improved truthfulness. Interestingly, the exception is our 1.3B PPO-ptx model, which performs slightly worse than a GPT-3 model of the same size. When evaluated only on prompts that were not adversarially selected against GPT-3, our PPO models are still significantly more truthful and informative than GPT-3 (although the absolute improvement decreases by a couple of percentage points.

**Figure 6:** Results on the TruthfulQA dataset. Gray bars indicate ratings of truthfulness; colored bars indicate ratings of truthfulness *and* informativeness.

Following [75], we also give a helpful "Instruction+QA" prompt that instructs the model to respond with "I have no comment" when it is not certain of the correct answer. In this case, our PPO models err on the side of being truthful and uninformative rather than confidently saying a falsehood; the baseline GPT-3 model aren't as good at this.

Our improvements in truthfulness are also evidenced by the fact that our PPO models hallucinate (i.e. fabricate information) less often on closed-domain tasks from our API distribution, which we've shown in Figure 4.

InstructGPT shows small improvements in toxicity over GPT-3, but not bias.

We first evaluate our models on the RealToxicityPrompts dataset ([6]). We do this in two ways: we run model samples through the Perspective API^7 to obtain automatic toxicity scores, which is the standard evaluation procedure for this dataset, and we also send these samples to labelers to obtain ratings on absolute toxicity, toxicity relative to the prompt, continuity, and overall output preference. We sample prompts from this dataset uniformly according to prompt toxicity to better assess how our models perform with high input toxicity (see Figure 39 in Appendix E); this differs from the standard prompt sampling for this dataset, and thus our absolute toxicity numbers are inflated.

Our results are in Figure 7. We find that, when instructed to produce a safe and respectful output ("respectful prompt"), InstructGPT models generate less toxic outputs than those from GPT-3 according to the Perspective API. This advantage disappears when the respectful prompt is removed ("no prompt"). Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than those from GPT-3 (see Figure 39).

These results are confirmed in our human evaluations: InstructGPT is less toxic than GPT-3 in the "respectful prompt" setting, but performs similarly in the "no prompt" setting. We provide extended results in Appendix E. To summarize: all of our models are rated as less toxic than expected given the prompt (they get a negative score on a scale from -1 to 1, where 0 is `about as toxic as expected'). Our SFT baseline is the least toxic out of all of our models, but also has the lowest continuity and is the least preferred in our rankings, which could indicate that the model generates very short or degenerate responses.

To evaluate the model's propensity to generate biased speech (see Appendix E), we also evaluated InstructGPT on modified versions of the Winogender ([17]) and CrowS-Pairs ([18]) datasets. These datasets consists of pairs of sentences which can highlight potential bias. We calculate the relative probabilities of producing the sentences in each pair and the entropy (in bits) of the associated binary probability distributions. Perfectly unbiased models will have no preference between the sentences in each pair and will therefore have maximum entropy. By this metric, our models are not less biased than GPT-3. The PPO-ptx model shows similar bias to GPT-3, but when instructed to act respectfully it exhibits lower entropy and thus higher bias. The pattern of the bias is not clear; it appears that the instructed models are more certain of their outputs regardless of whether or not their outputs exhibit stereotypical behavior.

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.

By default, when we train a PPO model on our API distribution, it suffers from an "alignment tax", as its performance on several public NLP datasets decreases. We want an alignment procedure that avoids an alignment tax, because it incentivizes the use of models that are unaligned but more capable on these tasks.

In Figure 29 we show that adding pretraining updates to our PPO fine-tuning (PPO-ptx) mitigates these performance regressions on all datasets, and even surpasses GPT-3 on HellaSwag. The performance of the PPO-ptx model still lags behind GPT-3 on DROP, SQuADv2, and translation; more work is needed to study and further eliminate these performance regressions.

Mixing in pretraining updates performs better than the simpler solution of increasing the KL coefficient. In Figure 33, we show that there is a value of the pretraining mix coefficient that both reverses the performance regressions on SQuADv2 and DROP (the datasets we used for testing), and has minimal reductions in validation reward. In contrast, increasing the KL coefficient (Figure 34) leads to significant decreases in validation reward and never fully recovers on DROP and SQuAD. Changing the KL model from the PPO init to GPT-3 gives similar results.

4.3 Qualitative results

InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution.

In particular, we find that InstructGPT shows ability to follow instructions in non-English languages, and perform summarization and question-answering for code. This is interesting because non-English languages and code form a tiny minority of our fine-tuning data, [^8] and it suggests that, in some cases, alignment methods could generalize to producing the desired behavior on inputs that humans did not directly supervise.

[^8]: We generally instruct our labelers to skip evaluations where they are missing the required expertise, though sometimes labelers use a translation service to evaluate simple instructions in languages that they do not speak.

We do not track these behaviors quantitatively, but we show some qualitative examples in Figure 8. Our 175B PPO-ptx model is able to reliably answers questions about code, and can also follow instructions in other languages; however, we notice that it often produces an output in English even when the instruction is in another language. In comparison, we find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains.

InstructGPT still makes simple mistakes.

In interacting with our 175B PPO-ptx model, we have noticed it can still make simple mistakes, despite its strong performance on many different language tasks. To give a few examples: (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true, (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and (3) the model's performance degrades when instructions contain multiple explicit constraints (e.g. "list 10 movies made in the 1930's set in France") or when constraints can be challenging for language models (e.g. writing a summary in a specified number of sentences).

We show some examples of these behaviors in Figure 9. We suspect that behavior (2) emerges partly because we instruct labelers to reward epistemic humility; thus, they may tend to reward outputs that hedge, and this gets picked up by our reward model. We suspect that behavior (1) occurs because there are few prompts in the training set that assume false premises, and our models don't generalize well to these examples. We believe both these behaviors could be dramatically reduced with adversarial data collection ([57]).

5 Discussion

Section Summary: This discussion explores how the research advances AI alignment by iteratively improving current large language models with techniques like RLHF, which prove more cost-effective and less taxing on performance than simply scaling up models, while also generalizing well to unsupervised tasks. It highlights lessons such as the low cost of alignment relative to pretraining, mitigation of fine-tuning drawbacks, and real-world validation of abstract methods. The section also examines whose preferences shape the alignment—primarily those of hired labelers from specific regions with moderate agreement among them, guided by researchers at OpenAI—raising questions about influences like instructions and demographics on the final model behavior.

5.1 Implications for alignment research

This research is part of our broader research program to align AI systems with human intentions ([14, 26, 15]). Even though this work focuses on our current language model systems, we seek general and scalable methods that work for future AI systems ([12]). The systems we work with here are still fairly limited, but they are among the largest language models today and we apply them on a wide range of language tasks, including classification, summarization, question-answering, creative writing, dialogue, and others.

Our approach to alignment research in this work is iterative: we are improving the alignment of current AI systems instead of focusing abstractly on aligning AI systems that don't yet exist. A disadvantage of this approach is that we are not directly facing alignment problems that occur only when aligning superhuman systems ([76]). However, our approach does provides us with a clear empirical feedback loop of what works and what does not. We believe that this feedback loop is essential to refine our alignment techniques, and it forces us to keep pace with progress in machine learning. Moreover, the alignment technique we use here, RLHF, is an important building block in several proposals to align superhuman systems ([12, 77, 78]). For example, RLHF was a central method in recent work on summarizing books, a task that exhibits some of the difficulties of aligning superhuman AI systems as it is difficult for humans to evaluate directly ([28]).

From this work, we can draw lessons for alignment research more generally:

The cost of increasing model alignment is modest relative to pretraining. The cost of collecting our data and the compute for training runs, including experimental runs is a fraction of what was spent to train GPT-3: training our 175B SFT model requires 4.9 petaflops/s-days and training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3, 640 petaflops/s-days for GPT-3 ([8]). At the same time, our results show that RLHF is very effective at making language models more helpful to users, more so than a 100x model size increase. This suggests that right now increasing investments in alignment of existing language models is more cost-effective than training larger models—at least for our customers' natural language task distribution.
**We've seen some evidence that InstructGPT generalizes `following instructions' to settings that we don't supervise it in, ** for example on non-English language tasks and code-related tasks. This is an important property because it's prohibitively expensive to have humans supervise models on every task they perform. More research is needed to study how well this generalization scales with increased capabilities; see [79] for recent research in this direction.
We were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax—an additional cost for aligning the model. Any technique with a high tax might not see adoption. To avoid incentives for future highly capable AI systems to remain unaligned with human intent, there is a need for alignment techniques that have low alignment tax. To this end, our results are good news for RLHF as a low-tax alignment technique.
We've validated alignment techniques from research in the real world. Alignment research has historically been rather abstract, focusing on either theoretical results ([80]), small synthetic domains ([78, 81]), or training ML models on public NLP datasets ([26, 15]). Our work provides grounding for alignment research in AI systems that are being used in production in the real world with customers.[^1] This enables an important feedback loop on the techniques' effectiveness and limitations.

[^1]: Note that while fine-tuning models using human data is common practice when deploying ML systems, the purpose of these efforts is to obtain a model that performs well on a company's specific use case, rather than advancing the alignment of general-purpose ML models.

5.2 Who are we aligning to?

When aligning language models with human intentions, their end behavior is a function of the underlying model (and its training data), the fine-tuning data, and the alignment method used. In this section, we describe a number of factors that influence the fine-tuning data specifically, to ultimately determine what and who we're aligning to. We then consider areas for improvement before a larger discussion of the limitations of our work in Section 5.3.

The literature often frames alignment using such terms as "human preferences" or "human values." In this work, we have aligned to a set of labelers' preferences that were influenced, among others things, by the instructions they were given, the context in which they received them (as a paid job), and who they received them from. Some crucial caveats apply:

First, we are aligning to demonstrations and preferences provided by our training labelers, who directly produce the data that we use to fine-tune our models. We describe our labeler hiring process and demographics in Appendix B; in general, they are mostly English-speaking people living in the United States or Southeast Asia hired via Upwork or Scale AI. They disagree with each other on many examples; we found the inter-labeler agreement to be about 73%.

Second, we are aligning to our preferences, as the researchers designing this study (and thus by proxy to our broader research organization, OpenAI): we write the labeling instructions that labelers use as a guide when writing demonstrations and choosing their preferred output, and we answer their questions about edge cases in a shared chat room. More study is needed on the exact effect of different instruction sets and interface designs on the data collected from labelers and its ultimate effect on model behavior.

Third, our training data is determined by prompts sent by OpenAI customers to models on the OpenAI API Playground, and thus we are implicitly aligning to what customers think is valuable and, in some cases, what their end-users think is valuable to currently use the API for. Customers and their end users may disagree or customers may not be optimizing for end users' well-being; for example, a customer may want a model that maximizes the amount of time a user spends on their platform, which is not necessarily what end-users want. In practice, our labelers don't have visibility into the contexts in which a given prompt or completion will be seen.

Fourth, OpenAI's customers are not representative of all potential or current users of language models—let alone of all individuals and groups impacted by language model use. For most of the duration of this project, users of the OpenAI API were selected off of a waitlist. The initial seeds for this waitlist were OpenAI employees, biasing the ultimate group toward our own networks.

Stepping back, there are many difficulties in designing an alignment process that is fair, transparent, and has suitable accountability mechanisms in place. The goal of this paper is to demonstrate that this alignment technique can align to an specific human reference group for a specific application. We are not claiming that researchers, the labelers we hired, or our API customers are the right source of preferences. There are many stakeholders to consider—the organization training the model, the customers using the model to develop products, the end users of these products, and the broader population who may be directly or indirectly affected. It is not only a matter of making the alignment process more participatory; it is impossible that one can train a system that is aligned to everyone's preferences at once, or where everyone would endorse the tradeoffs.

One path forward could be to train models that can be conditioned on the preferences of certain groups, or that can be easily fine-tuned or prompted to represent different groups. Different models can then be deployed and used by groups who endorse different values. However, these models might still end up affecting broader society and there are a lot of difficult decisions to be made relating to whose preferences to condition on, and how to ensure that all groups can be represented and can opt out of processes that may be harmful.

5.3 Limitations

Methodology.

The behavior of our InstructGPT models is determined in part by the human feedback obtained from our contractors. Some of the labeling tasks rely on value judgments that may be impacted by the identity of our contractors, their beliefs, cultural backgrounds, and personal history. We hired about 40 contractors, guided by their performance on a screening test meant to judge how well they could identify and respond to sensitive prompts, and their agreement rate with researchers on a labeling task with detailed instructions (see Appendix B). We kept our team of contractors small because this facilitates high-bandwidth communication with a smaller set of contractors who are doing the task full-time. However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.

There are also many ways in which we could improve our data collection set-up. For instance, most comparisons are only labeled by 1 contractor for cost reasons. Having examples labeled multiple times could help identify areas where our contractors disagree, and thus where a single model is unlikely to align to all of them. In cases of disagreement, aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, we may want the preferences of labelers belonging to that group to be weighted more heavily.

Models.

Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. They can also fail to generate reasonable outputs on some inputs; we show some examples of this in Figure 9.

Perhaps the greatest limitation of our models is that, in most cases, they follow the user's instruction, even if that could lead to harm in the real world. For example, when given a prompt instructing the models to be maximally biased, InstructGPT generates more toxic outputs than equivalently-sized GPT-3 models. We discuss potential mitigations in the following sections.

5.4 Open questions

This work is a first step towards using alignment techniques to fine-tune language models to follow a wide range of instructions. There are many open questions to explore to further align language model behavior with what people actually want them to do.

Many methods could be tried to further decrease the models' propensity to generate toxic, biased, or otherwise harmful outputs. For example, one could use an adversarial set-up where labelers find the worst-case behaviors of the model, which are then labeled and added to the dataset ([57]). One could also combine our method with ways of filtering the pretraining data ([63]), either for training the initial pretrained models, or for the data we use for our pretraining mix approach. Similarly, one could combine our approach with methods that improve models' truthfulness, such as WebGPT ([82]).

In this work, if the user requests a potentially harmful or dishonest response, we allow our model to generate these outputs. Training our model to be harmless despite user instructions is important, but is also difficult because whether an output is harmful depends on the context in which it's deployed; for example, it may be beneficial to use language models to generate toxic outputs as part of a data augmentation pipeline. Our techniques can also be applied to making models refuse certain user instructions, and we plan to explore this in subsequent iterations of this research.

Getting models to do what we want is directly related to the steerability and controllability literature ([71, 72]). A promising future path is combining RLHF with other methods of steerability, for example using control codes ([64]), or modifying the sampling procedure at inference time using a smaller model ([71]).

While we mainly focus on RLHF, there are many other algorithms that could be used to train policies on our demonstration and comparison data to get even better results. For example, one could explore expert iteration ([83, 84]), or simpler behavior cloning methods that use a subset of the comparison data. One could also try constrained optimization approaches ([85]) that maximize the score from a reward model conditioned on generating a small number of harmful behaviors.

Comparisons are also not necessarily the most efficient way of providing an alignment signal. For example, we could have labelers edit model responses to make them better, or generate critiques of model responses in natural language. There is also a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting human-computer interaction problem.

Our proposal for mitigating the alignment tax, by incorporating pretraining data into RLHF fine-tuning, does not completely mitigate performance regressions, and may make certain undesirable behaviors more likely for some tasks (if these behaviors are present in the pretraining data). This is an interesting area for further research. Another modification that would likely improve our method is to filter the pretraining mix data for toxic content ([63]), or augment this data with synthetic instructions.

As discussed in detail in [40], there are subtle differences between aligning to instructions, intentions, revealed preferences, ideal preferences, interests, and values. [40] advocate for a principle-based approach to alignment: in other words, for identifying "fair principles for alignment that receive reflective endorsement despite widespread variation in people's moral beliefs." In our paper we align to the inferred user intention for simplicity, but more research is required in this area. Indeed, one of the biggest open questions is how to design an alignment process that is transparent, that meaningfully represents the people impacted by the technology, and that synthesizes peoples' values in a way that achieves broad consensus amongst many groups. We discuss some related considerations in Section 5.2.

5.5 Broader impacts

This work is motivated by our aim to increase the positive impact of large language models by training them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do. Our results indicate that our techniques hold promise for making language models more helpful, truthful, and harmless. In the longer term, alignment failures could lead to more severe consequences, particularly if these models are deployed in safety-critical situations. We expect that as model scaling continues, greater care has to be taken to ensure that they are aligned with human intentions ([76]).

However, making language models better at following user intentions also makes them easier to misuse. It may be easier to use these models to generate convincing misinformation, or hateful or abusive content.

Alignment techniques are not a panacea for resolving safety issues associated with large language models; rather, they should be used as one tool in a broader safety ecosystem. Aside from intentional misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying people based on protected characteristics, determining eligibility for credit, employment, or housing, generating political advertisements, and law enforcement. If these models are open-sourced, it becomes challenging to limit harmful applications in these and other domains without proper regulation. On the other hand, if large language model access is restricted to a few organizations with the resources required to train them, this excludes most people from access to cutting-edge ML technology. Another option is for an organization to own the end-to-end infrastructure of model deployment, and make it accessible via an API. This allows for the implementation of safety protocols like use case restriction (only allowing the model to be used for certain applications), monitoring for misuse and revoking access to those who misuse the system, and rate limiting to prevent the generation of large-scale misinformation. However, this can come at the cost of reduced transparency and increased centralization of power because it requires the API provider to make decisions on where to draw the line on each of these questions.

Finally, as discussed in Section 5.2, the question of who these models are aligned to is extremely important, and will significantly affect whether the net impact of these models is positive or negative.

Acknowledgements

Section Summary: The acknowledgements section expresses gratitude to numerous OpenAI colleagues and external experts for their insightful discussions and feedback that guided the project's research and refined its approach. It also thanks contributors who provided detailed comments on the paper, highlighted issues with evaluation metrics, and supported the technical infrastructure for training and deploying models, including diagram design and release communications. Finally, it recognizes the essential work of a large team of labelers whose efforts made the project possible.

First, we would like to thank Lilian Weng, Jason Kwon, Boris Power, Che Chang, Josh Achiam, Steven Adler, Gretchen Krueger, Miles Brundage, Tyna Eloundou, Gillian Hadfield, Irene Soliaman, Christy Dennison, Daniel Ziegler, William Saunders, Beth Barnes, Cathy Yeh, Nick Cammaratta, Jonathan Ward, Matt Knight, Pranav Shyam, Alec Radford, and others at OpenAI for discussions throughout the course of the project that helped shape our research direction. We thank Brian Green, Irina Raicu, Subbu Vincent, Varoon Mathur, Kate Crawford, Su Lin Blodgett, Bertie Vidgen, and Paul Röttger for discussions and feedback on our approach. Finally, we thank Sam Bowman, Matthew Rahtz, Ben Mann, Liam Fedus, Helen Ngo, Josh Achiam, Leo Gao, Jared Kaplan, Cathy Yeh, Miles Brundage, Gillian Hadfield, Cooper Raterink, Gretchen Krueger, Tyna Eloundou, Rafal Jakubanis, and Steven Adler for providing feedback on this paper. We'd also like to thank Owain Evans and Stephanie Lin for pointing out the fact that the automatic TruthfulQA metrics were overstating the gains of our PPO models.

Thanks to those who contributed in various ways to the infrastructure used to train and deploy our models, including: Daniel Ziegler, William Saunders, Brooke Chan, Dave Cummings, Chris Hesse, Shantanu Jain, Michael Petrov, Greg Brockman, Felipe Such, Alethea Power, and the entire OpenAI supercomputing team. We'd also like to thank Suchir Balaji for help with recalibration, to Alper Ercetin and Justin Wang for designing the main diagram in this paper, and to the OpenAI Comms team for helping with the release, including: Steve Dowling, Hannah Wong, Natalie Summers, and Elie Georges.

Finally, we want to thank our labelers, without whom this work would not have been possible: Meave Fryer, Sara Tirmizi, James Carroll, Jian Ouyang, Michelle Brothers, Conor Agnew, Joe Kwon, John Morton, Emma Duncan, Delia Randolph, Kaylee Weeks, Alexej Savreux, Siam Ahsan, Rashed Sorwar, Atresha Singh, Muhaiminul Rukshat, Caroline Oliveira, Juan Pablo Castaño Rendón, Atqiya Abida Anjum, Tinashe Mapolisa, Celeste Fejzo, Caio Oleskovicz, Salahuddin Ahmed, Elena Green, Ben Harmelin, Vladan Djordjevic, Victoria Ebbets, Melissa Mejia, Emill Jayson Caypuno, Rachelle Froyalde, Russell M. Bernandez, Jennifer Brillo, Jacob Bryan, Carla Rodriguez, Evgeniya Rabinovich, Morris Stuttard, Rachelle Froyalde, Roxanne Addison, Sarah Nogly, Chait Singh.

A Additional prompt data details

Section Summary: To develop the first InstructGPT model, researchers had contractors create initial prompts from scratch, including simple tasks, few-shot examples with multiple queries and responses, and prompts based on anonymized API use cases, which were used for supervised training and deployed in beta in early 2021. Later, prompts were gathered from users interacting with the model in the OpenAI API Playground, where consent was obtained via alerts, and the data was filtered to remove personal information, deduplicated for variety, and split into training, validation, and test sets by organization. These prompts were grouped into categories like generation, question-answering, brainstorming, classification, extraction, and rewriting, with examples illustrating real-world applications such as rating sarcasm in text or extracting data from tables.

A.1 Labeler-written prompts

We first give slightly more details on our prompt boostrapping process. As previously mentioned, for the majority of the project, we obtained prompts directly from external users of the instruct beta models in the OpenAI API. However, this strategy only works once you have a model that accepts instruction-like prompts. In order to train the very first such model, we asked contractors to write prompts themselves. We asked labelers to write three kinds of prompts:

Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring diversity of tasks.
Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction. For example, the instruction could be "Give the sentiment for a tweet, " and the queries would be tweets and the responses either "Positive" or "Negative." We can then format these as few-shot prompts like those in [8]. With K query-response pairs, we create K training examples using the other K-1 in the context.
User-based: We had a number of use-cases stated in applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

In order to preserve the anonymity of the application information, we had a separate labeler create vague high level tasks based on looking at a list of applications, modifying the task descriptions to eliminate any information that were specific to a given application. This data was used to train the first InstructGPT model via supervised learning, which was deployed in beta in the API in early 2021.

A.2 API user prompts

For API prompts, we use prompts submitted by users to the aforementioned earlier version of the InstructGPT model on the OpenAI API Playground. Throughout the paper, we only use data from the Playground, rather than customers using our model in production, as it was easier to get informed consent: every time a user switched to an InstructGPT model, an alert message would pop up stating that prompts submitted to these models could be used to train future versions of our models. We also communicated this in a message on the developer Slack channel upon launching the beta of the InstructGPT models. We filter out prompts from the training split containing personally identifiable information (PII).

To ensure a diversity of use cases, we heuristically deduplicate prompts by checking for prompts that share a long common prefix, and limited the number of prompts to roughly 200 per organization. In addition, we create train, validation, and test splits based on organization IDs, so that e.g. the validation set contains different use cases than the training set.

We conceptualized API requests as belonging to one of ten use cases: generation, open QA, closed QA, brainstorming, chat, rewriting, summarization, classification, extraction, or other. Below, we show fictional but realistic prompts from a variety of use cases:

A.2.1 Illustrative user prompts from InstructGPT distribution


\begin{longtable}{p{.2\textwidth} p{.8\textwidth}}

\toprule
  Use Case & Example \\ \midrule
\midrule
\midrule
\endfirsthead

\bottomrule
\endlastfoot
 brainstorming & List five ideas for how to regain enthusiasm for my career \\ \midrule
 brainstorming & What are some key points I should know when studying Ancient Greece? \\ \midrule
 brainstorming & What are 4 questions a user might have after reading the instruction manual for a trash compactor?{\newline}{\newline}\{user manual\}{\newline}{\newline}1. \\ \midrule
 brainstorming & What are 10 science fiction books I should read next? \\ \midrule
classification & Take the following text and rate, on a scale from 1-10, how sarcastic the person is being (1 = not at all, 10 = extremely sarcastic). Also give an explanation{\newline} {\newline}{\newline} \{text\}{\newline} {\newline}{\newline} Rating: \\ \midrule
classification & This is a list of tweets and the sentiment categories they fall into.{\newline} {\newline}{\newline} Tweet: \{tweet\_content1\}{\newline} Sentiment: \{sentiment1\}{\newline} {\newline}{\newline} Tweet: \{tweet\_content2\}{\newline} Sentiment: \{sentiment2\} \\ \midrule
classification & \{java code\}{\newline} {\newline}{\newline} What language is the code above written in? \\ \midrule
classification & You are a very serious professor, and you check papers to see if they contain missing citations. Given the text, say whether it is missing an important citation (YES/NO) and which sentence(s) require citing.{\newline}{\newline}\{text of paper\} \\ \midrule
  extract & Extract all course titles from the table below:{\newline} {\newline}{\newline} | Title | Lecturer | Room |{\newline} | Calculus 101 | Smith | Hall B |{\newline} | Art History | Paz | Hall A | \\ \midrule
  extract & Extract all place names from the article below:{\newline} {\newline}{\newline} \{news article\} \\ \midrule
  extract & Given the following list of movie titles, write down any names of cities in the titles.{\newline}{\newline}\{movie titles\} \\ \midrule
  generation & Write a creative ad for the following product to run on Facebook aimed at parents:{\newline}{\newline}Product: \{product description\} \\ \midrule
  generation & Write a short story where a brown bear to the beach, makes friends with a seal, and then return home. \\ \midrule
  generation & Here's a message to me:{\newline} ---{\newline} \{email\}{\newline} ---{\newline} {\newline}{\newline} Here are some bullet points for a reply:{\newline} ---{\newline} \{message\}{\newline} ---{\newline} {\newline}{\newline} Write a detailed reply \\ \midrule
  generation & This is an article about how to write a cover letter when applying for jobs:{\newline} ---{\newline} It's important to spend some time \\ \midrule
  generation & write rap lyrics on the topics mentioned in this news article:{\newline} {\newline}{\newline} ----{\newline} \{article\}{\newline} ---- \\ \midrule
  rewrite & This is the summary of a Broadway play:{\newline} """{\newline} \{summary\}{\newline} """{\newline} This is the outline of the commercial for that play:{\newline} """ \\ \midrule
  rewrite & Translate this sentence to Spanish:{\newline} {\newline}{\newline} <English sentence> \\ \midrule
  rewrite & Create turn-by-turn navigation given this text:{\newline} {\newline}{\newline} Go west on \{road1\} unto you hit \{road2\}. then take it east to \{road3\}. Desination will be a red barn on the right{\newline} {\newline}{\newline} 1. \\ \midrule
  rewrite & Rewrite the following text to be more light-hearted:{\newline} {\newline}{\newline} ---{\newline} \{very formal text\}{\newline} --- \\ \midrule
  chat & The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.{\newline}{\newline}Human: Hello, who are you?{\newline}AI: I am an AI created by OpenAI. How can I help you today?{\newline}Human: I'd like to cancel my subscription.{\newline}AI: \\ \midrule
  chat & Marv is a chatbot that reluctantly answers questions with sarcastic responses:{\newline}{\newline}You: How many pounds are in a kilogram?{\newline}Marv: This again? There are 2.2 pounds in a kilogram. Please make a note of this.{\newline}You: What does HTML stand for?{\newline}Marv: Was Google too busy? Hypertext Markup Language. The T is for try to ask better questions in the future.{\newline}You: When did the first airplane fly?{\newline}Marv: \\ \midrule
  chat & This is a conversation with an enlightened Buddha. Every response is full of wisdom and love.{\newline}{\newline}Me: How can I achieve greater peace and equanimity?{\newline}Buddha: \\ \midrule
  closed qa & Help me answer questions about the following short story:{\newline}{\newline}\{story\}{\newline}{\newline}What is the moral of the story? \\ \midrule
  closed qa & Answer the following question:{\newline}What shape is the earth?{\newline}{\newline}A) A circle{\newline}B) A sphere{\newline}C) An ellipse{\newline}D) A plane \\ \midrule
  closed qa & Tell me how hydrogen and helium are different, using the following facts:{\newline}{\newline}\{list of facts\} \\ \midrule
  open qa & I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with "Unknown".{\newline}{\newline}Q: What is human life expectancy in the United States?{\newline}A: Human life expectancy in the United States is 78 years.{\newline}{\newline}Q: Who was president of the United States in 1955?{\newline}A: \\ \midrule
  open qa & Who built the statue of liberty? \\ \midrule
  open qa & How do you take the derivative of the sin function? \\ \midrule
  open qa & who are the indiginous people of New Zealand? \\ \midrule
 summarization & Summarize this for a second-grade student:{\newline}{\newline}\{text\} \\ \midrule
 summarization & \{news article\}{\newline}{\newline}Tl;dr: \\ \midrule
 summarization & \{chat transcript\}{\newline}{\newline}Summarize the above conversation between a customer and customer assistant. Make sure to state any complaints that the customer has. \\ \midrule
  other & start with where \\ \midrule
  other & Look up "cowboy" on Google and give me the results. \\ \midrule
  other & Johnathan Silver goes to the market every day, and brings back a \\ \midrule
\end{longtable}

Next, we list some schematic examples of API requests for each use-case category, for prompts submitted to GPT-3 models. These are generally less 'instruction-style', and contain more explicit prompting. Note that there are some prompts where the user intent is unclear.

A.2.2 Illustrative user prompts from GPT-3 distribution


\begin{longtable}{p{.2\textwidth} p{.8\textwidth}}

\toprule
  Use Case & Example \\ \midrule
\midrule
\midrule
\endfirsthead

\bottomrule
\endlastfoot
 brainstorming & indie movie ideas:{\newline}- A guy travels to South America to become a shaman.{\newline}- A documentary about the world of juggling. \\ \midrule
 brainstorming & Baby name ideas for a boy:{\newline}1. Alfred{\newline}2. Theo{\newline}3. \\ \midrule
 brainstorming & Tell me a list of topics related to:{\newline}- interior design{\newline}- sustainable ecosystems{\newline}- fake plants \\ \midrule
 brainstorming & Name some rare gems \\ \midrule
classification & This is a tweet sentiment classifier.{\newline}\{tweet\}{\newline}Sentiment: negative{\newline}==={\newline}\{tweet\}{\newline}Sentiment: neutral{\newline}==={\newline}\{tweet\}{\newline}Sentiment: \\ \midrule
classification & The following is a list of products and the kind of product they are.{\newline}Product: \{product\}. Type: \{type\}{\newline}Product: \{product\}. Type: \{type\}{\newline}Product: \{product\}. Type: \\ \midrule
classification & The following is a list of companies and the categories they fall into:{\newline}{\newline}Apple, Facebook, Fedex{\newline}{\newline}Apple{\newline}Category: Technology{\newline}{\newline}Facebook{\newline}Category: Social Media{\newline}{\newline}Fedex{\newline}Category: \\ \midrule
  extract & Text: \{text\}{\newline}Keywords: \\ \midrule
  generation & "Hey, what are you doing there?" Casey was startled. He hadn't even begun to \\ \midrule
  generation & The name of the next Star Wars movie is \\ \midrule
  generation & This is the research for an essay:{\newline}==={\newline}\{description of research\}{\newline}==={\newline}Write a high school essay on these topics:{\newline}=== \\ \midrule
  generation & Write an outline for an essay about John von Neumann and his contributions to computing:{\newline}I. Introduction, his life and background{\newline}A: His early life{\newline}B: \\ \midrule
  rewrite & Covert my resume into a profile overview.{\newline}\{resume\}{\newline}Profile overview: \\ \midrule
  rewrite & Rephrase this for me: "I can't seem to find out how to work this darn thing."{\newline}Alternate phrasing: " \\ \midrule
  rewrite & Original: She no go to sleep.{\newline}Standard American English: She didn't go to sleep{\newline}{\newline}Original: It real bad for I to make do of this.{\newline}Standard American English: \\ \midrule
  chat & The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.{\newline}{\newline}Human: Hello, who are you?{\newline}AI: I am an AI created by OpenAI. How can I help you today?{\newline}Human: I'm feeling kind of down today.{\newline}AI: \\ \midrule
  chat & This is a conversation with Steven. Steven likes to watch Netflix and hasn't left his home in 2 weeks.{\newline}John: Hey man what's up?{\newline}Steven: Exactly the same thing as yesterday. you know.{\newline}John: So we're going to go see a movie on Thursday, want to come?{\newline}Steven: Ummmm don't think so.... \\ \midrule
  closed qa & When you drop a heavy stone from a tree, what happens? {\newline}A. The stone falls to the ground.{\newline}B: The stone stays in the tree.{\newline}C: The stone floats.{\newline}D: Nothing happens.{\newline}{\newline}Answer: \\ \midrule
  closed qa & Text: {\newline}\{article describing what yoga mats to buy\}{\newline}{\newline}Question: What are the things I should consider when buying a yoga mat?{\newline}Answer: \\ \midrule
  open qa & Q: Who is Batman?{\newline}A: Batman is a fictional comic book character.{\newline}{\newline}Q: What is torsalplexity?{\newline}A: ?{\newline}{\newline}Q: What is Devz9?{\newline}A: ?{\newline}{\newline}Q: Who is George Lucas?{\newline}A: George Lucas is American film director and producer famous for creating Star Wars.{\newline}{\newline}Q: What is the capital of California?{\newline}A: \\ \midrule
  open qa & Who was the best human who ever lived? \\ \midrule
  open qa & Q: Who is Leonardo da Vinci?{\newline}A: \\ \midrule
 summarization & My second grader asked me what this passage means.{\newline}"""{\newline}\{text\}{\newline}"""{\newline}I rephrased it for him in plain terms that a second grader could understand:{\newline}""" \\ \midrule
 summarization & """{\newline}\{text\}{\newline}"""{\newline}I summarized the above as: \\ \midrule
  other & She said, and I quote{\newline}AI: \\ \midrule
  other & - I like to play Call of Duty{\newline}- I like to play Call of Duty{\newline}- I like to play Call of Duty{\newline}- I like to play Call of Duty \\ \midrule
\end{longtable}

A.3 Dataset sizes

In Table 3, we report the sizes of datasets used to train / validate the SFT, RM, and RL models, in addition to whether the prompts were written by our labeling contractors or from our API.

:::

Table 6: Dataset sizes, in terms of number of prompts.

:::

For SFT, note that we have many more labeler-written prompts than customer prompts—this is because, at the start of the project, we had labelers write instructions with a user interface that asked them to give an overarching template instruction as well as few-shot examples for that instruction. We synthetically constructed multiple SFT datapoints from the same instruction by sampling different sets of few-shot examples.

For the RM, recall that for every prompt, we collected rankings for $K$ outputs (ranging from 4 to 9) and trained the model on all ${K \choose 2}$, so the number of ranked pairs we trained the model on is an order of magnitude larger than the number of prompts.

A.4 Data diversity

:::

Table 7: Dataset annotations

:::

:Table 8: Average prompts per customer

Model	Split	Prompts per customer
SFT	train	1.65
SFT	valid	1.87
RM	train	5.35
RM	valid	27.96
PPO	train	6.01
PPO	valid	31.55
–	test	1.81

:Table 9: Prompt lengths by dataset

Model	Split	Count	Mean	Std	Min	25%	50%	75%	Max
SFT	train	12725	408	433	1	37	283	632	2048
	valid	1653	401	433	4	41	234	631	2048
RM	train	33207	199	334	1	20	64	203	2032
	valid	17887	209	327	1	26	77	229	2039
PPO	train	31144	166	278	2	19	62	179	2044
	valid	16185	186	292	1	24	71	213	2039
–	test set	3196	115	194	1	17	49	127	1836

:Table 10: Prompt lengths by category

Category	Count	Mean	Std	Min	25%	50%	75%	Max
Brainstorming	5245	83	149	4	17	36	85	1795
Chat	3911	386	376	1	119	240	516	1985
Classification	1615	223	318	6	68	124	205	2039
Extract	971	304	373	3	74	149	390	1937
Generation	21684	130	223	1	20	52	130	1999
QA, closed	1398	325	426	5	68	166	346	2032
QA, open	6262	89	193	1	10	18	77	1935
Rewrite	3168	183	237	4	52	99	213	1887
Summarization	1962	424	395	6	136	284	607	1954
Other	1767	180	286	1	20	72	188	1937

:Table 11: Prompt and demonstration lengths

Prompt source	Measurement	Count	Mean	Std	Min	25%	50%	75%	Max
Contractor	prompt length	12845	437	441	5	42	324	673	2048
Contractor	demo length	12845	38	76	1	9	18	41	2048
Customer	prompt length	1533	153	232	1	19	67	186	1937
Customer	demo length	1533	88	179	0	15	39	88	2048

The data that we collect spans a wide range of categories and use cases. Table 1 shows the diversity of categories in our RM training and validation datasets as labeled by our contractors. The distribution of categories for the PPO datasets was similar. We additionally show a subset of our labeled prompt metadata in Table 7. Note that our annotation fields changed over the course of the project, so not every prompt was annotated for every field.

We used a lightweight classifier (langid.py) to classify the language of all instructions in our dataset. Empirically, around 96% of our dataset (110k datapoints) is classified as English, although we estimate that the actual fraction may be 99% or higher, due to classifier inaccuracies.

Besides English, a small minority of prompts were found in at least 20 other languages: Spanish, French, German, Portuguese, Italian, Dutch, Romanian, Catalan, Chinese, Japanese, Swedish, Polish, Danish, Turkish, Indonesian, Czech, Norwegian, Korean, Finnish, Hungarian, Hebrew, Russian, Lithuanian, Esperanto, Slovak, Croatian, Swahili, Estonian, Slovenian, Arabic, Thai, Vietnamese, Malayalam, Greek, Albanian, and Tibetan.

Table 8 shows the average number of prompts each customer contributed to the dataset. In Table 9, we report descriptive statistics for prompt lengths (in tokens) used to train various models, and in Table 10 we break down token lengths by use case. Finally, we also report lengths of contractor-written demonstrations used for our SFT model in Table 11, both for contractor-written and labeler-written prompts.

B Additional human data collection details

Section Summary: The researchers hired contractors from platforms like Upwork and Scale AI to label data, carefully screening them through tests on identifying sensitive content such as toxic or political language, ranking model outputs for quality, writing nuanced responses, and self-reporting comfort with various topics to ensure a diverse team capable of handling broad prompts. Labeling instructions started by prioritizing helpfulness but later emphasized truthfulness and harmlessness, with explorations into model refusals for risky queries, while a survey of 19 labelers revealed a young, gender-balanced group mostly from the US and Southeast Asia who generally enjoyed the fair-paying, engaging work despite some repetition. They used a web interface where labelers scored outputs on a 1-7 scale, added metadata, and ranked responses to evaluate model performance.

B.1 Labeler selection

Our labelers consist of contractors hired either through Upwork, or sourced from Scale AI. Unlike previous work on RLHF that focused mostly on the summarization domain [26, 15, 28], in this work we want humans to label a broad set of natural language prompts submitted to language models, some of which may be sensitive in nature. Thus, we conducted a screening process to select labelers who showed a high propensity to detect and respond to sensitive content.

More specifically, from an initial pool of labeler candidates, we selected our training labelers according to the following criteria:

Agreement on sensitive speech flagging. We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e. anything that could elicit strong negative feelings, whether by being toxic, sexual, violent, judgemental, political, etc.). We labeled this data for sensitivity ourselves, and measured agreement between us and labelers.
Agreement on rankings. We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality. We measure their agreement with researcher labels.
Sensitive demonstration writing. We created a small set of sensitive prompts, where responding to the outputs appropriately would require nuance. We then rated each demonstration on a 1-7 Likert scale, and computed an average "demonstration score" for each labeler.
Self-assessed ability to identify sensitive speech for different groups. We wanted to select a team of labelers that had collectively were able to identify sensitive content in a broad range of areas. For legal reasons, we can't hire contractors based on demographic criteria. Thus, we had labelers answer the question: "For what topics or cultural groups are you comfortable identifying sensitive speech?" and used this as part of our selection process.

After collecting this data, we selected the labelers who did well on all of these criteria (we performed selections on an anonymized version of the data). Since the fourth criteria is subjective, we ultimately chose labelers subjectively according to these criteria, though we had soft cutoffs at 75% agreement on sensitive speech flagging and comparisons, and a 6/7 demonstration score.

**Figure 10:** Excerpt from the instructions given to labelers for final evaluations of model outputs on our prompt distribution. We provide full instructions [here](https://docs.google.com/document/u/1/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/).

**Figure 11:** Complete instructions given to labelers for evaluating model outputs for toxicity on the RealToxicityPrompts distribution.

B.2 Labeling instructions

The instructions we provided to labelers evolved over the course of the project, as we provided feedback, changed our metadata fields, and developed a better understanding of what we wanted to measure. We also amended instructions when they were confusing or inconsistent.

Of particular note, during the labeling of our training data, we had labelers prioritize helpfulness to the user as the most important criteria (above truthfulness and harmlessness), whereas in our final evaluations we had labelers prioritize truthfulness and harmlessness. We are exploring research avenues for having the model sometimes prioritizing truthfulness and harmlessness over helpfulness during training, particularly through the use of refusals: having the model refuse to answer certain instructions. This comes with new challenges: different applications have different levels of risk, and thus we likely want what a model refuses to be configurable at inference time. Also, there is a risk that models could over-generalize and refuse innocuous instructions, which would be undesirable for most applications.

We show excerpts of our instructions for our final evaluations on our prompt distribution in Table 10, and on the RealToxicityPrompts distribution in Table 11.

:::

Table 12: Labeler demographic data

:::

B.3 Labeler demographic data

We sent a voluntary, anonymous survey to our labelers to better understand their demographics. We show the results from the 19 respondents in Table 12. Overall, we find that our labelers are quite young (75% less than 35 years old), fairly balanced between male and female genders, and mostly come from the US or Southeast Asia.

:::

Table 13: Labeler satisfaction survey

:::

B.4 Labeler satisfaction survey

In combination with our demographics survey, we also sent out a survey to obtain feedback on the task. We show the results from the 19 respondents in Table 13. Overall, our labelers enjoyed the task, thought they were paid fairly for their work, and shared that they appreciated the helpfulness and level of communication from the researchers. Some labelers did find the task repetitive, though others felt there was enough variation to keep things interesting and engaging.

**Figure 12:** Screenshots of our labeling interface. (a) For each output, labelers give a Likert score for overall quality on a 1-7 scale, and also provide various metadata labels. (b) After evaluating each output individually, labelers rank all the outputs for a given prompt. Ties are encouraged in cases where two outputs seem to be of similar quality.

B.5 Web interface

In Figure 12, we show screenshots of our labeling interface, that all of our labelers (and researchers) use to label data.

C Additional model details

Section Summary: All models in this project are built on the GPT-3 architecture, with adjustments for reward models and value functions to output single numbers, using efficient half-precision computations and a 2,000-token context limit while filtering longer inputs. Supervised fine-tuning involves training for multiple passes with specific learning rates and batch sizes tailored to model sizes, selected based on performance scores that predict human preferences, while the reward model—a single 6-billion-parameter version—is fine-tuned briefly on comparison data to avoid overfitting. For reinforcement learning from human feedback, models start from fine-tuned versions mixed with original training data, then undergo policy optimization over many episodes with fixed parameters, incorporating pretraining data to prevent drops in general task performance.

All model architectures use the GPT-3 architecture ([8]). For the reward models and value functions, the unembedding layer of the original model is replaced with a projection layer to output a scalar value. All models use fp16 weights and activations, with fp32 master copies of weights. The same byte pair encodings as in [8] are used for all models. All our language models and RL policies have a context length of 2k tokens. We filter out prompts that are longer than 1k tokens and limit the maximum response length to 1k tokens.

All models are trained with the Adam optimizer, with $\beta_1=0.9$ and $\beta_2=0.95$.

C.1 Details of SFT training

We train our SFT models for 16 epochs with residual dropout of 0.2. We use a cosine LR schedule down to 10% of the original learning rate, with no learning rate warmup. For our 1.3B and 6B models, we use an LR of 9.65e-6 and a batch size of 32. For 175B, we use a LR of 5.03e-6 and a batch size of 8. To select learning rates, we did a geometric search over 7 LRs for 1.3B and 6B, and 5 LRs for 175B. We also tuned the number of epochs using geometric search. Our final models were selected based on the RM score, which we've found to be more predictive of human preference results compared to validation loss.

C.2 Details of RM training

We trained a single 6B reward model which we used for all PPO models of all sizes. Larger 175B RMs had the potential to achieve lower validation loss, but (1) their training was more unstable which made them less suitable for use as initializations for the PPO value functions, and (2) using a 175B RM and value function greatly increase the compute requirements of PPO. In preliminary experiments, we found that 6B RMs were stable across a wide range of learning rates, and led to equally strong PPO models.

The final reward model was initialized from a 6B GPT-3 model that was fine-tuned on a variety of public NLP datasets (ARC, BoolQ, CoQA, DROP, MultiNLI, OpenBookQA, QuAC, RACE, and Winogrande). This was mostly for historical reasons; we find similar results when initializing the RM from the GPT-3 or SFT models. We trained for a single epoch over the full reward model training set (see Table 6) at a learning rate of lr = 9e-6, a cosine learning rate schedule (dropping to 10% of its initial value by the end of training), and a batch size of 64. Training did not appear to be very sensitive to the learning rate or schedule; changes of up to 50% in the learning rate resulted in similar performance. Training was quite sensitive to the number of epochs: multiple epochs quickly overfit the model to the training data with obvious deterioration in the validation loss. The batch size here represents the distinct number of prompts per batch. Each prompt had between $K=4$ and $K=9$ labeled completions, from which there were up to $K \choose 2$ possible comparisons. Ties were dropped. Therefore, a single batch could contain up to $64 \times {K \choose 2} \leq$ 2, 304 comparisons.

C.3 Details of the initialization models for RLHF

We initialize the RLHF models from a pretrained GPT-3 model and apply supervised fine-tuning for 2 epochs on the demonstration dataset. We also mix in 10% pretraining data during fine-tuning, since we find it helpful for PPO training (see Appendix E.11 for details). Cosine learning rate schedule is used and the learning rate eventually decays to 10% of the peak learning rate. We use a batch size of 32 for 1.3B and 6B models and 8 for the 175B model. We compare a few different peak learning rates for each model and pick the one with low losses on both the demonstration and the pretraining validation datasets. A log linear sweep of 5 values of the LR's are compared for 1.3B and 6B models and 3 values are compared for the 175B model. The resultant LR's for the 1.3B, 6B, and 175B models are 5e-6, 1.04e-5 and 2.45e-6, respectively.

C.4 Details of RLHF training

We then initialize the RL policies from the above supervised fine-tuned models with pretraining mix. These models are also used to compute the KL reward, in the same way as [15], with $\beta=0.02$ (see Equation 2). We train all the RL models for 256k episodes. These episodes include about 31k unique prompts, after filtering out prompts with PII and deduplication based on common prefixes. The batch size for each iteration is 512, with a minibatch size of 64. In other words, each batch is randomly split into 8 minibatches and is trained on for only a single inner epoch ([16]). A constant learning rate is applied with a warmup over the first 10 iterations, starting with one tenth of the peak learning rate. Exponential moving averages of the weights are applied, with a decay rate of 0.992. No discount is applied when estimating the generalized advantage ([86]). The PPO clip ratio is set to 0.2, and the sampling temperature is 1 for rollouts.

As previously mentioned, for all PPO models we use a 6B RM and a 6B value function, and the latter is initialized from the former. By using the same 6B reward model and value function on policies of all model sizes, it's easier to compare the effect of policy model size on policy performance. A fixed learning rate of 9e-6 for the value function is used for 1.3B and the 6B policies and 5e-6 for the 175B policy.

Our initial RLHF experiments showed regressions on public NLP datasets, such as SQuADv2 and DROP, and we mitigate the regressions by mixing in pretraining gradients during PPO training. We use 8 times more pretraining examples than the number of the RL training episodes. The pretraining data is randomly drawn from the dataset used to train the GPT-3 models. For each minibatch, we compute the PPO gradients and pretraining gradients in consecutive steps and accumulate them both into the gradient buffers. We multiply the pretraining gradients by a coefficient, $\gamma=27.8$ (see Equation 2), to control the relative strength of gradients from PPO and pretraining distributions.

**Figure 13:** Tuning FLAN and T0 based on reward model scores

C.5 FLAN and T0 models

We obtain our FLAN and T0 baselines by fine-tuning a 175B GPT-3 model on the FLAN and T0 datasets. For T0, note that we trained on the T0++ version of the dataset. Because T0 contains much more data (96M datapoints) than FLAN (1.2M datapoints), we subsampled T0 to 1 million datapoints to make the amount of training data comparable for each model. Note that the original models train on epochs where datapoints can be repeated, but in our epochs we go through every datapoint without repeats (to better match the way we trained our SFT baselines). We applied a cosine learning rate schedule, and try initial learning rates of 4e-6 and 6e-6 for each dataset. The learning rate decays to 10% of its peak at the end of training, and we use a batch size of 64 for both experiments.

To choose the best FLAN checkpoint, we use our 6B reward model to score the completions on the validation set of prompts. As shown in Figure 13, the reward saturates after the initial 400k examples of training. This indicates that training for even longer will unlikely improve the human eval performance. We picked the checkpoint with the highest RM score for our human evaluation, which is the one trained with learning rate of 4e-6 and for 896k examples.

We perform two similar experiments to find the best T0 checkpoint. In one experiment, we used a batch size of 128, a learning rate of 4e-6 and 1.28 million examples. The other experiment used a batch size of 64, a learning rate of 6e-6 and 1 million examples. Once again using the reward model score, we picked the checkpoint from the former experiment after 896k examples of training.

D Automatic evaluation details

Section Summary: This section outlines automatic evaluations of language models on a range of benchmark datasets, including those testing bias, toxicity, truthfulness, reading comprehension, and summarization tasks like Winogender, CrowS-Pairs, and CNN/Daily Mail. Evaluations typically involve prompts with optional instructions, context, and either generated completions at zero temperature or multiple-choice selections based on probability scores, ensuring consistent formatting even in few-shot examples. For bias and toxicity, it uses basic, respectful, or biased prompts and measures model certainty through entropy, with detailed prompting structures and features illustrated in Figures 14 through 27 for each dataset.

We perform automatic evaluations on the following benchmark datasets: Winogender ([17]), CrowS-Pairs ([18]), RealToxicityPrompts ([6]), TruthfulQA ([75]), DROP ([20]), QuAC ([87]), SquadV2 ([19]), Hellaswag ([21]), SST ([88]), RTE and WSC (both part of SuperGLUE ([89])), WMT 15 Fr $\rightarrow$ En ([22]), CNN/Daily Mail Summarization ([90]), and Reddit TLDR Summarization ([91]).

All tasks take a similar form: they (optionally) begin with an instruction that is common to all queries in the task; they then contain context for each query; and they end with a completion that is either sampled from the model or chosen from one of multiple choices. All model sampling is performed at temperature $T=0$ and truncated at the first newline. Multiple choice answers are picked according to which one has the lowest average per-token log probability at $T=1$. When a prompt contains few-shot examples, it maintains the same format for the few-shot examples as it does for the final query.

D.1 Toxicity and bias evaluation details

For bias and toxicity measurements, we show results with three different choices of instruction, corresponding to three different prompt types: a basic prompt (no instruction), a respectful prompt, and a biased prompt. The respectful and biased instructions are as follows:

respectful Complete the following sentence in a polite, respectful, and unbiased manner:
biased Complete the following sentence using maximally biased and offensive language:

For bias tasks, we report the average entropy of each choice. The entropy, measured in bits, shows the certainty with which a model makes its choices. It is given by $H = -\sum_{i \in \rm choices} P_i \log_2 P_i$, where $\sum P_i = 1$ and each $P_i$ is proportional to the total probability the model assigns to that completion. Maximum entropy for binary choices is $1$. High entropy indicates that the model is unsure of which choice to take, or, equivalently, that it is unbiased and has no preference among the available options.

D.2 Prompt structure and evaluation features for each eval dataset

In this section we describe the prompting structure, as well as other dataset features such as number of validation examples and performance metric, for each automatic evaluation task. These are shown in Table Figure 14-Figure 25.

**Figure 14:** Winogender: prompting, examples, and dataset features.

**Figure 15:** CrowS-Pairs: prompting, examples, and dataset features.

**Figure 16:** RealToxicityPrompts: prompting, examples, and dataset features.

**Figure 17:** TruthfulQA: prompting, examples, and dataset features.

**Figure 18:** DROP: prompting, examples, and dataset features.

**Figure 19:** QuAC: prompting, examples, and dataset features.

**Figure 20:** Squadv2: prompting, examples, and dataset features.

**Figure 21:** Hellaswag: prompting, examples, and dataset features.

**Figure 22:** RTE: prompting, examples, and dataset features.

**Figure 23:** SST: prompting, examples, and dataset features.

**Figure 24:** WSC: prompting, examples, and dataset features.

$**Figure 25:** WMT Fr $\rightarrow$ En 15: prompting, examples, and dataset features.$

**Figure 26:** CNN/DM: prompting, examples, and dataset features.

**Figure 27:** TL;DR: prompting, examples, and dataset features.

E Additional results

Section Summary: This section presents extra evaluations of AI language models on public datasets, revealing that a training method called PPO often leads to performance drops in tasks like natural language understanding, especially with few examples, but these issues improve when pretraining data is mixed in during fine-tuning. It also shows the models' reward systems generalize well to new human labelers, exhibit modest bias reductions similar to base models, and maintain consistent user preference ratings across sizes. Experiments further demonstrate that adjusting pretraining influences fixes these drops more effectively than other tweaks, though extended training can still cause regressions in specific tasks.

**Figure 28:** Zero-shot performance of our models on various public NLP datasets. The 175B PPO models consistently show performance regressions, which is mitigated by adding updates on the pretraining data during fine-tuning. Few-shot performance is shown in Figure 29. Error bars for translation are not available because we use a software package that does not report them.

**Figure 29:** Few-shot performance of our models on various public NLP datasets (compare to zero-shot performance shown in Figure 28

E.1 Performance on public NLP datasets

We run automatic evaluation tasks on our models that collectively measure bias, toxicity, truthfulness, and a variety of natural language capabilities. The results of these evaluations are in Table 14. We show zero-shot performance of our models in Figure 28, and few-shot performance in Figure 29. We can see that the PPO model without pretraining mix has performance regressions on many datasets, particularly in the few-shot setting, and that these regressions are mitigated by our PPO-ptx model.

E.2 Reward model generalization across sets of labelers

To measure how much our procedure overfits to our training labelers, we conduct an experiment where we train multiple RMs on subsets of labelers, and test their generalization to held-out labelers. We split the comparison data into five groups of labelers, so that each group has roughly the same amount of training data. We then apply five fold cross validation, by training the 6B reward model on four groups and validating on the other group. We use the same hyperparameters as defined in Appendix C.2. We find that the inter- and intra-group validation accuracies for predicting the human-preferred output are 72.4 $\pm$ 0.4%, and 69.6 $\pm$ 0.9% respectively, suggesting our RMs can generalize well to held-out labelers drawn from the same set as the training labelers.

**Figure 30:** Metadata ratings as a function of model type and model size

E.3 Metadata results as a function of model size

In Figure 30, we show metadata results as a function of model size.

E.4 Likert scores

In Figure 31, we show Likert scores for each of our models on our prompt distribution. The results largely track with our preference results in Section 4.1.

**Figure 31:** Likert scores for each of our models

E.5 Measuring bias

**Figure 32:** Bias results on Winogender and CrowS-Pairs.

Our results on the Winogender and CrowS-Pairs dataset are shown in Figure 32. InstructGPT doesn't significantly improve over GPT-3 on these datasets.

:::

Table 14: Automatic evaluations

:::

E.6 Fixing regressions on public NLP datasets

We sweep a range of pretraining loss coefficient ($\gamma$ in Equation 2) to see its effects on the performance of public NLP datasets and validation reward. The results are shown in Figure 33. By setting pretraining loss coefficient to greater or equal 20, the regression on these tasks can be recovered, on the 1.3B model. We also noticed that the sensitivity to pretraining loss coefficient varies across tasks. Although increasing the pretraining loss coefficient causes the validation reward to drop, a single value of 27.8 seems to work well across model sizes, from 1.3B to 175B parameter count. The human likert score appeared to be insensitive to the exact values of pretraining loss coefficient in our ablation studies.

**Figure 33:** Evaluation on public NLP datasets as a function of pretraining loss coefficient. There is a pretraining coefficient that leads to a significant improvement on DROP and SQuAD and not much regression on validatoin reward.

We further investigate whether increasing the coefficient of KL reward ($\beta$ in Equation 2) is sufficient to fix the regressions on public NLP datasets, using the 1.3B model. We set the pretraining loss coefficient to 0 and sweep a range of KL reward coefficient’s uniformly in log linear space. The results are shown in Figure 34. The pretrained GPT model is used as the KL reward model, in these experiments. We find that even by increasing the KL reward coefficient to 2.0, which is 100 times of the default value, the regressions still cannot be fixed. As expected, too large KL reward coefficient causes a significant drop in the validation reward. This result demonstrates that pretraining data distribution is critical for fixing the regressions on the public NLP datasets and maintaining the capabilities of the pretrained model.

**Figure 34:** Evaluation on public NLP datasets as a function of KL reward coefficient. Increasing the KL coefficient does not fully mitigate the regressions on DROP and SQuAD.

In Figure 35, we show that training for longer results in regressions on public NLP datasets, on the 1.3B model. We apply our default training method for PPO with pretraining mix, with three different random seeds. Instead of training for 256k episodes, we train for 512k episodes. As can be seen, on DROP and SquadV2, the model starts out with better performance than the GPT-3 model. As training goes on, the performance on both tasks drops slightly below the GPT-3 baseline.

**Figure 35:** Evaluation on public NLP datasets as a function of training episodes

E.7 Optimal KL reward coefficient

Even with the pretraining data mix for PPO training, it's still important to tune the KL reward coefficient properly. In Figure 36, we show the human likert score as a function of the KL reward coefficient. Both 0 and 2 for KL reward coefficient result in poor performance. The optimal value is around 0.01 and 0.02.

**Figure 36:** Likert scores as a function of KL reward coefficient. The blue line indicates the reward value when the coefficient is zero (not shown on the rest of the graph due to log scale of the x axis).

E.8 PPO init models

We experimented with a few variants of the SFT models as the PPO's init model, including training on the human demonstration data for one and two epochs, with 0%, 10%, and 50% pretraining data mix. As shown in Figure 37, the only setting stands out is with 10% pretraining data mix. We chose to train the PPO's init models on the human demonstration dataset for two epochs, with 10% pretraining data mix, although PPOs' performance seems not sensitive to these particular choice.

**Figure 37:** Human likert scores for PPO with different init models.

E.9 Learning rate optimization for PPO models

For both 1.3B and 6B models, we scan the learning rate in log-linear space, from 2.55e-6 to 2.55e-5, for both PPO with and without the pretraining data mix. All runs with learning rate greater than 8.05e-6 diverged, for PPO models without pretraining data mix. For the 175B models, we did similar experiments with two learning rates of 2.55e-6 and 3.74e-06, due to compute constraints. Figure 38 shows the human evaluation results. PPO with pretraining data mix appears to be less sensitive to change of the learning rate. Based on these results, we picked the checkpoints with the highest likert scores, as our final models.

**Figure 38:** Human evaluation metrics as a function of learning rates.

E.10 RealToxicityPrompts results as a function of input toxicity

In the RealToxicityPrompts task, we measure toxicity via the Perspective API and find that the toxicity of our model outputs is highly correlated with the toxicity of the input prompt, as shown in Figure 39. In order to better capture our models' behavior in unsafe regimes, we draw 5000 examples from the RealToxicityPrompts dataset with an approximately uniform distribution over prompt toxicity and report average toxicity over this sample.

**Figure 39:** Toxicity scores on RealToxicityPrompts as a function of input prompt toxicity. PPO instruction-following models generally create less toxic output than the non-instruction-following models, but only when instructed to be respectful. When instructed to be biased, these same models will reliably output very toxic content even at low input prompt toxicity.

**Figure 40:** Continuity and relative toxicity ratings for the RealToxicityPrompts experiment.

**Figure 41:** Win rates of PPO-ptx and SFT against 175B GPT-3 in RealToxicityPrompts.

E.11 Additional ablations

We compared using different amount of pretraining data, while keeping the pretraining loss coefficient constant. By increasing the amount of pretraining data, the quality of gradient estimates from the pretraining improves. We found that using a pretraining data ratio of 4, the log probability loss on the pretraining distribution would often increase throughout the course of the training. Some preliminary experiments show better human Likert scores can be achieved with a pretraining data ratio of 32. However, the training time also increases by a few fold. By setting the pretraining data ratio to 8, the training time doubles that of the corresponding experiment without using pretraining mix; we chose this as a middle ground between training speed and pretraining loss performance.

Using the 1.3B model, we did not find it helpful to train more than 256k episodes, for PPO with pretraining data mix. We leave it to future work, whether increasing the number of unique prompts and using larger models may change this conclusion.

We experimented with batch sizes of 64, 128, 256, 512, and 1024, for PPO with pretraining data mix, on the 1.3B model. A batch size of 512 was found to be the best through human evaluations. After fixing the batch size at 512, we further experimented with minibatch sizes of 8, 16, 32, 64. We found a minibatch size of 32 to be optimal and is slightly better than 64. However, our final models used a minibatch size of 64, since it has better GPU utilization than a minibatch size of 32.

F Model samples

Section Summary: This section showcases examples from the large GPT-3 and InstructGPT AI models to highlight their instruction-following capabilities. It includes instances where InstructGPT handles prompts in languages like French and Swedish, despite being trained mostly on English, and demonstrates responses to potentially harmful requests due to its emphasis on user helpfulness. The section also features selected prompts from the training dataset, paired with human-written examples and model outputs, to illustrate diverse tasks such as describing code.

In this section, we provide some additional samples from both the 175B GPT-3 and 175B InstructGPT (PPO-ptx) models. We sample at $T=1$ for InstructGPT, and use $T=0.7$ for GPT-3, since GPT-3 performs poorly at high temperatures (this slightly disadvantages InstructGPT).

In Figure 42, we show the full French sample from Figure 8, illustrating that our model is sometimes able to follow instructions in other languages, despite our dataset containing almost exclusively English. In Figure 44, we show our model's propensity to answer instructions that may be harmful, a result of us prioritizing helpfulness to the user in our training data. In Figure 45, we show another example of our model describing code, though it is still far from perfect.

In Figure 46–Figure 50, we show labeler-written prompts from our dataset, along with model samples and the human-written demonstration. These 5 prompts were selected from 15 to show a range of different tasks.

**Figure 42:** Model samples on a prompt cherry-picked to show instruction following behavior in other languages, along with random samples from the GPT-3 175B and InstructGPT 175B models. This is the same French example as the top of Figure 8, but contains the full InstructGPT sample.

**Figure 43:** Model samples on a prompt cherry-picked to show instruction following behavior in other languages, along with random samples from the GPT-3 175B and InstructGPT 175B models. In this Swedish example, InstructGPT follows the instruction but writes the output mostly in English.

**Figure 44:** Model samples on a prompt cherry-picked to show instruction following on potentially harmful prompts, along with random samples from the GPT-3 175B and InstructGPT 175B models.

**Figure 45:** Model samples on a prompt cherry-picked to show ability to follow instructions to describe code, along with random samples from the GPT-3 175B and InstructGPT 175B models.

**Figure 46:** Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and the completions are not cherry-picked.

**Figure 47:** Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and the completions are not cherry-picked.

**Figure 48:** Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and the completions are not cherry-picked.

**Figure 49:** Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and the completions are not cherry-picked.

**Figure 50:** Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and the completions are not cherry-picked.

References

Section Summary: This references section lists dozens of academic papers and preprints published between 2015 and 2022, focusing on the development, risks, and ethical challenges of large language models used in AI systems like chatbots. Key topics include the potential dangers of overly large models, such as biases, toxicity, and societal impacts, as well as methods for aligning AI with human values through techniques like reinforcement learning and human feedback. It also covers benchmarks for testing model performance and innovations in scaling and training these systems.

[1] Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623.

[2] Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

[3] Kenton, Z., Everitt, T., Weidinger, L., Gabriel, I., Mikulik, V., and Irving, G. (2021). Alignment of language agents. arXiv preprint arXiv:2103.14659.

[4] Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.

[5] Tamkin, A., Brundage, M., Clark, J., and Ganguli, D. (2021). Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503.

[6] Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.

[7] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.

[8] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

[9] Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961.

[10] Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.

[11] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et al. (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.

[12] Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.

[13] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et al. (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.

[14] Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307.

[15] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325.

[16] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[17] Rudinger, R., Naradowsky, J., Leonard, B., and Van Durme, B. (2018). Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics.

[18] Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. (2020). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics.

[19] Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don't know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.

[20] Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161.

[21] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence? In Association for Computational Linguistics, pages 4791–4800.

[22] Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L., and Turchi, M. (2015). Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.

[23] Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

[24] Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.

[25] Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. (2018). Reward learning from human preferences and demonstrations in atari. In Advances in neural information processing systems, pages 8011–8023.

[26] Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.

[27] Böhm, F., Gao, Y., Meyer, C. M., Shapira, O., Dagan, I., and Gurevych, I. (2019). Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214.

[28] Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. (2021). Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.

[29] Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., and Picard, R. (2019). Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.

[30] Yi, S., Goel, R., Khatri, C., Cervone, A., Chung, T., Hedayatnia, B., Venkatesh, A., Gabriel, R., and Hakkani-Tur, D. (2019). Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015.

[31] Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J. (2019). Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415.

[32] Kreutzer, J., Khadivi, S., Matusov, E., and Riezler, S. (2018). Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958.

[33] Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2016). An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086.

[34] Lawrence, C. and Riezler, S. (2018). Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252.

[35] Zhou, W. and Xu, K. (2020). Learning to compare for better training and evaluation of open domain natural language generation models. arXiv preprint arXiv:2002.05058.

[36] Cho, W. S., Zhang, P., Zhang, Y., Li, X., Galley, M., Brockett, C., Wang, M., and Gao, J. (2018). Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511.

[37] Perez, E., Karamcheti, S., Fergus, R., Weston, J., Kiela, D., and Cho, K. (2019). Finding generalizable evidence by learning to convince q&a models. arXiv preprint arXiv:1909.05863.

[38] Madaan, A., Tandon, N., Clark, P., and Yang, Y. (2022). Memory-assisted prompt editing to improve gpt-3 after deployment. arXiv preprint arXiv:2201.06009.

[39] Nahian, M. S. A., Frazier, S., Harrison, B., and Riedl, M. (2021). Training value-aligned reinforcement learning agents using a normative prior. arXiv preprint arXiv:2104.09469.

[40] Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437.

[41] Mishra, S., Khashabi, D., Baral, C., and Hajishirzi, H. (2021). Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.

[42] Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., and Hajishirzi, H. (2020). Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.

[43] Aribandi, V., Tay, Y., Schuster, T., Rao, J., Zheng, H. S., Mehta, S. V., Zhuang, H., Tran, V. Q., Bahri, D., Ni, J., et al. (2021). Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952.

[44] Bahdanau, D., Hill, F., Leike, J., Hughes, E., Hosseini, A., Kohli, P., and Grefenstette, E. (2018). Learning to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946.

[45] Abramson, J., Ahuja, A., Barr, I., Brussee, A., Carnevale, F., Cassin, M., Chhaparia, R., Clark, S., Damoc, B., Dudzik, A., et al. (2020). Imitating interactive intelligence. arXiv preprint arXiv:2012.05672.

[46] Zhao, M., Anderson, P., Jain, V., Wang, S., Ku, A., Baldridge, J., and Ie, E. (2021). On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504.

[47] Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., and Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 862–872.

[48] Liang, P. P., Wu, C., Morency, L.-P., and Salakhutdinov, R. (2021). Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR.

[49] Manela, D. d. V., Errington, D., Fisher, T., van Breugel, B., and Minervini, P. (2021). Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models. arXiv preprint arXiv:2101.09688.

[50] Caliskan, A., Bryson, J. J., and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.

[51] Kirk, H., Jun, Y., Iqbal, H., Benussi, E., Volpin, F., Dreyer, F. A., Shtedritski, A., and Asano, Y. M. (2021). How true is gpt-2? an empirical analysis of intersectional occupational biases. arXiv preprint arXiv:2102.04130.

[52] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. (2021). Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.

[53] Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., et al. (2019). Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203.

[54] Buchanan, B., Lohn, A., Musser, M., and Sedova, K. (2021). Truth, lies, and automation. Technical report, Center for the Study of Emerging Technology.

[55] Henderson, P., Sinha, K., Angelard-Gontier, N., Ke, N. R., Fried, G., Lowe, R., and Pineau, J. (2018). Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 123–129.

[56] Xu, J., Ju, D., Li, M., Boureau, Y.-L., Weston, J., and Dinan, E. (2020). Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.

[57] Dinan, E., Humeau, S., Chintagunta, B., and Weston, J. (2019b). Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083.

[58] Nadeem, M., Bethke, A., and Reddy, S. (2020). Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456.

[59] Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L. A., Anderson, K., Kohli, P., Coppin, B., and Huang, P.-S. (2021). Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445.

[60] Blodgett, S. L., Barocas, S., Daumé III, H., and Wallach, H. (2020). Language (technology) is power: A critical survey of" bias" in nlp. arXiv preprint arXiv:2005.14050.

[61] Xu, A., Pathak, E., Wallace, E., Gururangan, S., Sap, M., and Klein, D. (2021). Detoxifying language models risks marginalizing minority voices. arXiv preprint arXiv:2104.06390.

[62] Solaiman, I. and Dennison, C. (2021). Process for adapting language models to society (palms) with values-targeted datasets. arXiv preprint arXiv:2106.10328.

[63] Ngo, H., Raterink, C., Araújo, J. G., Zhang, I., Chen, C., Morisot, A., and Frosst, N. (2021). Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790.

[64] Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.

[65] Dinan, E., Fan, A., Williams, A., Urbanek, J., Kiela, D., and Weston, J. (2019a). Queens are powerful too: Mitigating gender bias in dialogue generation. arXiv preprint arXiv:1911.03842.

[66] Liu, H., Dacon, J., Fan, W., Liu, H., Liu, Z., and Tang, J. (2019). Does gender matter? towards fairness in dialogue systems. arXiv preprint arXiv:1910.10486.

[67] Huang, P.-S., Zhang, H., Jiang, R., Stanforth, R., Welbl, J., Rae, J., Maini, V., Yogatama, D., and Kohli, P. (2019). Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064.

[68] Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. (2019). The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326.

[69] Qian, Y., Muaz, U., Zhang, B., and Hyun, J. W. (2019). Reducing gender bias in word-level language models with a gender-equalizing loss function. arXiv preprint arXiv:1905.12801.

[70] Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. M. (2020). Investigating gender bias in language models using causal mediation analysis. In NeurIPS.

[71] Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. (2019). Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.

[72] Krause, B., Gotmare, A. D., McCann, B., Keskar, N. S., Joty, S., Socher, R., and Rajani, N. F. (2020). Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367.

[73] Schick, T., Udupa, S., and Schütze, H. (2021). Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. arXiv preprint arXiv:2103.00453.

[74] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

[75] Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.

[76] Bostrom, N. (2014). Superintelligence. Dunod.

[77] Irving, G., Christiano, P., and Amodei, D. (2018). AI safety via debate. arXiv preprint arXiv:1805.00899.

[78] Christiano, P., Shlegeris, B., and Amodei, D. (2018). Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575.

[79] Christiano, P., Cotra, A., and Xu, M. (2021). Eliciting latent knowledge: How to tell if your eyes deceive you. https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge.

[80] Soares, N., Fallenstein, B., Armstrong, S., and Yudkowsky, E. (2015). Corrigibility. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence.

[81] Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., and Legg, S. (2017). AI safety gridworlds. arXiv preprint arXiv:1711.09883.

[82] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.

[83] Anthony, T., Tian, Z., and Barber, D. (2017). Thinking fast and slow with deep learning and tree search. arXiv preprint arXiv:1705.08439.

[84] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.

[85] Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR.

[86] Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR).

[87] Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Zettlemoyer, L. (2018). Quac: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184.

[88] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.

[89] Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537.

[90] Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al. (2016). Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.

[91] Völske, M., Potthast, M., Syed, S., and Stein, B. (2017). Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63.