Atlas: Few-shot Learning with Retrieval Augmented Language Models

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Gautier Izacard$^{*}$ $^{\diamondsuit}$ $^{\clubsuit}$ $^{\heartsuit}$
[email protected]

Patrick Lewis$^{*}$ $^{\diamondsuit}$
[email protected]

Maria Lomeli$^{\diamondsuit}$
[email protected]

Lucas Hosseini$^{\diamondsuit}$
[email protected]

Fabio Petroni$^{\diamondsuit}$
[email protected]

Timo Schick$^{\diamondsuit}$
[email protected]

Jane Dwivedi-Yu$^{\diamondsuit}$
[email protected]

Armand Joulin$^{\diamondsuit}$
[email protected]

Sebastian Riedel$^{\diamondsuit}$ $^{\spadesuit}$
[email protected]

Edouard Grave$^{\diamondsuit}$
[email protected]

$^{\diamondsuit}$ Meta AI Research, $^{\clubsuit}$ ENS, PSL University, $^{\heartsuit}$ Inria, $^{\spadesuit}$ University College London

$^{*}$ equal contribution

Abstract

Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is unclear whether they work in few-shot settings. In this work we present Atlas, a carefully designed and pre-trained retrieval augmented language model able to learn knowledge intensive tasks with very few training examples. We perform evaluations on a wide range of tasks, including MMLU, KILT and NaturalQuestions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameters model by 3% despite having 50x fewer parameters.

Executive Summary: Atlas is a retrieval-augmented language model designed for strong few-shot performance on knowledge-intensive tasks such as question answering and fact checking. Large language models currently achieve few-shot learning by scaling parameters to store world knowledge internally, which is inefficient and inflexible. Retrieval-augmented approaches can offload memory to an external index, yet prior systems had not demonstrated reliable few-shot results.

The work therefore set out to design, pre-train, and evaluate a retrieval-augmented model that matches or exceeds the few-shot accuracy of far larger parametric models while remaining smaller, more updatable, and more interpretable. The authors jointly pre-train a dense retriever (based on Contriever) and a T5-based Fusion-in-Decoder reader on large unlabeled corpora, using masked-language-modeling objectives together with several novel or adapted losses that let the reader supervise the retriever. They then fine-tune the resulting model—called Atlas—on standard benchmarks, testing both 64-shot regimes and full training sets, and they experiment with different index contents, update schedules, and fine-tuning strategies such as query-side updates.

The central results are that Atlas-11B reaches 42.4 percent exact match on Natural Questions and 74.5 percent on TriviaQA after seeing only 64 examples, surpassing the 540-billion-parameter PaLM model by roughly three points despite using fifty times fewer parameters. On MMLU it achieves 47.9 percent in the de-biased 5-shot setting and 56.6 percent when the same small data pool is used for multi-task training. In full-data regimes it sets new state-of-the-art numbers on Natural Questions, TriviaQA, FEVER, and five KILT tasks. Ablations confirm that joint pre-training is essential for few-shot success, that index content and temporal alignment measurably affect accuracy, and that the model can be updated at test time simply by swapping the index without retraining.

These outcomes show that retrieval can effectively separate memorization from generalization, yielding smaller, cheaper-to-run systems whose knowledge remains current and inspectable. Practitioners can therefore obtain high performance on knowledge tasks with far less compute and with the practical advantage of being able to refresh facts simply by rebuilding or replacing the index. For near-term deployment the authors recommend query-side fine-tuning when data are scarce and full retriever updates or re-ranking when larger training sets are available; they also demonstrate that product quantization can reduce index memory by an order of magnitude with negligible loss.

The main limitations are dependence on index quality and coverage, residual risk of leakage from the retrieval corpus into evaluation sets, and the modest computational overhead of periodic index refreshes during training. Few-shot results also exhibit domain-level variance, although aggregate scores across fifty-seven MMLU tasks remain stable. Overall, the evidence indicates that retrieval-augmented models merit serious consideration whenever tasks require up-to-date or verifiable factual knowledge.

1. Introduction

Section Summary: Large language models perform well at learning new tasks from just a handful of examples, but it remains unclear how much of that ability depends on storing vast amounts of knowledge inside their parameters. This paper examines whether that internal memorization can be replaced by an external, searchable knowledge source and introduces Atlas, a retrieval-augmented model that pairs a dense retriever with a comparatively modest sequence-to-sequence network. Trained with joint pre-training and task-specific fine-tuning, Atlas reaches or exceeds the few-shot accuracy of models many times its size on question-answering and fact-checking benchmarks while also offering easier knowledge updates.

Large language models (LLMs) are impressive few-shot learners ([1, 2, 3, 4]). They are able to learn new tasks with very few examples or even from instructions alone. For this generalisation ability to emerge, the key ingredients are scaling both the parameter count of the model, and the size of the training data. Large language models owe this improvement to both a larger computational budget, enabling more complex reasoning, and the ability to memorize more information related to downstream tasks from the larger training data. While it is intuitive to assume that increased reasoning abilities lead to better generalisation, and hence few-shot learning, the same is not true for in-parameter memorisation. Specifically, it is unclear to what extent effective few-shot learning requires vast knowledge in the parameters of the model.

In this paper, we investigate whether few-shot learning requires models to store a large amount of information in their parameters, and if memorisation can be decoupled from generalisation. To do so, we leverage the fact that memory can be outsourced and replaced by an external non-parametric knowledge source by employing a retrieval-augmented architecture. These models employ a non-parametric memory, e.g. a neural retriever over a large, external, potentially non-static knowledge source to enhance a parametric language model. In addition to their memorisation abilities, such architectures are attractive due to a number of other established advantages in terms of adaptability, interpretability and efficiency ([5, 6, 7, 8] inter alia). However, retrieval-augmented models have yet to demonstrate compelling few-shot learning capabilities. In this work we address this gap, and present $\textsc{Atlas}$, a retrieval-augmented language model capable of strong few-shot learning, despite having lower parameter counts than other powerful recent few-shot learners.

$\textsc{Atlas}$ retrieves relevant documents based on the current context by using a general-purpose dense retriever using a dual-encoder architecture, based on the Contriever ([9]). The retrieved documents are processed, along with the current context, by a sequence-to-sequence model using the Fusion-in-Decoder architecture ([10]) that generates the corresponding output. We study the impact of different techniques to train $\textsc{Atlas}$ on its few-shot performance on a range of downstream tasks, including question answering and fact checking. We find that jointly pre-training the components is crucial for few-shot performance, and we carefully evaluate a number of existing and novel pre-training tasks and schemes for this purpose. $\textsc{Atlas}$ achieves strong downstream performance in both few-shot and resource-rich settings. For example, with only 11B parameters, $\textsc{Atlas}$ achieves an accuracy of 42.4% on NaturalQuestions using 64 training examples (45.1% with a Wikipedia-only index), outperforming PaLM ([4]), a 540B parameter model by almost 3 points, and 64.0% in a full-dataset setting with a Wikipedia index, establishing a new state of the art by 8 points.

**Figure 1:** We introduce $\textsc{Atlas}$, a retrieval-augmented language model that exhibits strong few-shot performance on knowledge tasks, and uses retrieval during both pre-training and fine-tuning.

In summary we make the following contributions:

  • A thorough study on how to design and train retrieval-augmented language models, with a focus on downstream few-shot learning and sample efficiency.
  • The findings of this study lead to a retrieval-augmented language model, called $\textsc{Atlas}$, that exhibits few-shot abilities that emerge at lower scale than standard LLM.
  • We provide an exploration of fine-tuning strategies to efficiently adapt both the retriever and the language model to the task at hand.
  • Thorough downstream experiments in few-shot settings, demonstrating state-of-the-art results on few-shot NaturalQuestions (+2.8%), TriviaQA (+3.3%), FEVER (+5.1%), and results on par or stronger than models with 15 $\times{}$ more parameters on MMLU.
  • Experiments investigating full-dataset finetuning, setting new state-of-the-art results in NaturalQuestions (+8.1%), TriviaQA (+9.3%) and 5 KILT Tasks.
  • Experiments demonstrating the updatability and interpretability characteristics of $\textsc{Atlas}$.
  • Experiments demonstrating that a compressed index using product quantisation achieves comparable performance as an uncompressed index while resulting in a 5x memory reduction.

Our code, pretrained $\textsc{Atlas}$ checkpoints, and various supporting data are available at https://github.com/facebookresearch/atlas

2. Method

Section Summary: The method follows a text-to-text setup in which a model receives a plain-text query and produces a text response, covering tasks such as question answering or classification. To supply external knowledge, it adds a retrieval step that pulls the top relevant passages from a large text collection and feeds them to a language model, which then generates the final output; both the retriever and the language model are transformer-based networks. The retriever is trained jointly with the language model using loss functions that turn the model’s own attention patterns into training signals, allowing the system to learn from ordinary query–answer pairs without any labeled documents.

Our approach follows the text-to-text framework ([11]). This means that all the tasks are framed as follows: the system gets a text query as input, and generates a text output. For example, in the case of question answering, the query corresponds to the question and the model needs to generate the answer. In the case of classification tasks, the query corresponds to the textual input, and the model generates the lexicalized class label, i.e. the word corresponding to the label. We give more examples of downstream tasks, from the KILT benchmark in Figure 2. As many natural language processing tasks require knowledge, our goal is to enhance standard text-to-text models with retrieval, which, as we hypothesise in the introduction, may be crucial to endow models with few-shot capabilities.

2.1 Architecture

Our model is based on two sub-models: the retriever and the language model. When performing a task, from question answering to generating Wikipedia articles, our model starts by retrieving the top-k relevant documents from a large corpus of text with the retriever. Then, these documents are fed to the language model, along with the query, which in turns generates the output. Both the retriever and the language model are based on pre-trained transformer networks, which we describe in more detail below.

Retriever.

Our retriever module is based on the Contriever ([9]), an information retrieval technique based on continuous dense embeddings. The Contriever uses a dual-encoder architecture, where the query and documents are embedded independently by a transformer encoder ([12, 13]). Average pooling is applied over the outputs of the last layer to obtain one vector representation per query or document. A similarity score between the query and each document is then obtained by computing the dot product between their corresponding embeddings. The Contriever model is pre-trained using the MoCo contrastive loss ([14]), and uses unsupervised data only. As shown in the following section, an advantage of dense retrievers is that both query and document encoders can be trained without document annotation, using standard techniques such as gradient descent and distillation.

**Figure 2:** Examples of query and output pairs for different tasks from KILT.

Language model.

For the language model, we rely on the T5 sequence-to-sequence architecture ([11]). We rely on the Fusion-in-Decoder modification of sequence-to-sequence models, and process each document independently in the encoder ([10]). We then concatenate the outputs of the encoder corresponding to the different documents, and perform cross-attention over this single sequence in the decoder. Following [10], we concatenate the query to each document in the encoder. Another way to process the retrieved documents in the language model would be to concatenate the query and all the documents, and to use this long sequence as input of the model. Unfortunately, this approach does not scale with the number of documents, since the self-attention in the encoder results in a quadratic complexity with respect to the number of documents.

2.2 Training objectives for the retriever

In this section, we discuss four different loss functions to train the retriever jointly with the language model. We consider loss functions that leverage the language model to provide supervisory signal to train the retriever. In other words, if the language model finds a document useful when generating the output, the retriever objective should encourage the retriever to rank said document higher. This allows us to train models using only query and output pairs from the task of interest, without relying on document annotations. For example, in the case of fact checking, a model only requires pairs of claims and corresponding verdicts but no documents containing the evidence to back up the verdict. In practice, we can apply this approach on any task, including self-supervised pre-training. As shown in the experimental section, pre-training is critical for obtaining models that exhibit few-shot learning abilities.

Attention Distillation (ADist).

The first loss that we consider is based on the attention scores of the language model, and is heavily inspired by [15]. The main idea is that the cross-attention scores between the input documents and the output, can be used as a proxy of the importance of each input document when generating the output. In particular, [15] showed that these scores can be aggregated across attention heads, layers and tokens for a given document to obtain a single score for each document. Then, these scores can be distilled into the retriever by minimizing the KL-divergence with the probability distribution $p_{\textsc{retr}}$ over the top-K documents ${\mathbf{d}k}{1, ..., K}$ obtained from the retriever:

$ p_{\textsc{retr}}\left(\mathbf{d} \ | \ \mathbf{q}\right) = \frac{\exp(s(\mathbf{d}, \mathbf{q}) / \theta)}{\sum_{k=1}^K \exp(s(\mathbf{d}_k, \mathbf{q}) / \theta)},\tag{1} $

where $s$ is the dot-product between the query and documents vectors and $\theta$ is a temperature hyper-parameter.

In the original paper, it was proposed to use the pre-softmax scores from the decoder cross-attentions, and average across heads, layers and tokens. Here, we propose an alternative which gives slightly stronger results, which relies on the following observation. In the attention mechanism, as defined by

$ \mathbf{y} = \sum_{n=1}^N \alpha_n \mathbf{v}_n, $

the contribution to the output $\mathbf{y}$ of a particular token $n$ cannot be evaluated from the attention score $\alpha_n$ alone, but should also take the norm of the value $\mathbf{v}_n$ into account. Hence, we use the quantity $\alpha_n | \mathbf{v}_n |2$ as the measure of relevance for token $n$. Following [15], we average these scores over all attention heads, layers, and tokens to obtain a score for each document. We apply the $\textsc{Softmax}$ operator over the resulting scores, to obtain a distribution $p{\textsc{attn}}(\mathbf{d}k)$ over the top-K retrieved documents. We then minimize the KL-divergence between $p{\textsc{attn}}(\mathbf{d}k)$, and the distribution $p{\textsc{retr}}$ from the retriever defined in Equation 1:

$ \textsc{KL}(p_{\textsc{attn}} \ | \ p_{\textsc{retr}}) = \sum_{k=1}^K p_{\textsc{attn}} (\mathbf{d}k) \log \left(\frac{p{\textsc{attn}} (\mathbf{d}k)}{p{\textsc{retr}} (\mathbf{d}_k)} \right). $

Here, this loss is only used to optimize the parameters of the retriever, and not the language model. When using recent deep learning frameworks, this is achieved by applying a $\textsc{StopGradient}$ operator on $p_{\textsc{attn}}$.

End-to-end training of Multi-Document Reader and Retriever (EMDR$^2$).

Next, we consider the method introduced by [16], which is inspired by the expectation-maximization algorithm, treating retrieved documents as latent variables. Given a query $\mathbf{q}$, the corresponding output $\mathbf{a}$ and the set $\mathcal{D}_K$ of top-K retrieved documents with the current retriever, the EMDR$^2$ loss to train the retriever is

$ \log \left[\sum_{k=1}^K p_{\textsc{lm}}(\mathbf{a} \ | \ \mathbf{q}, \mathbf{d}k) p{\textsc{retr}}(\mathbf{d}_k \ | \ \mathbf{q}) \right], $

where $p_{\textsc{retr}}$ is again the probability over the top-K documents obtained with the retriever, as defined by Equation 1. Again, only the parameters of the retriever are updated by applying a $\textsc{StopGradient}$ operator around $ p_{\textsc{lm}}$. One should note that the probability distribution over documents that maximizes this loss function is an indicator of the document corresponding to the highest probability of the output according to the language model. Finally, in practice, the EMDR$^2$ loss function is applied at the token level, and not at the sequence level.

Perplexity Distillation (PDist).

Third, we discuss a simpler loss function which is loosely inspired by the objectives from the attention distillation and EMDR$^2$ methods ([15, 16]). More precisely, we want to train the retriever to predict how much each document would improve the language model perplexity of the output, given the query. To this end, we minimize the KL-divergence between the documents distribution of the retriever (Eqn. 1), and the documents posterior distribution according to the language model, using a uniform prior:

$ p_k \propto p_{LM} (\mathbf{a} \ | \ \mathbf{d}_k, \mathbf{q}). $

Using the $\textsc{Softmax}$ operator, we have that

$ p_k = \frac{\exp(\log p_{LM} (\mathbf{a} \ | \ \mathbf{d}k, \mathbf{q}))}{\sum{i=1}^K \exp (\log p_{LM} (\mathbf{a} \ | \ \mathbf{d}_i, \mathbf{q}))}. $

Leave-one-out Perplexity Distillation (LOOP).

Finally, we propose an objective based on how much worse the prediction of the language model gets, when removing one of the top-k retrieved documents. To do so, we compute the log probability of the output for each subset of k-1 documents, and use the negative value as relevance score for each document. Following the previous loss function, we use the softmax operator to obtain a probability distribution over documents:

$ p_{\textsc{loop}}(\mathbf{d}k) = \frac{\exp(- \log p{LM} (\mathbf{a} \ | \ \mathcal{D}K \setminus { \mathbf{d}k }, \mathbf{q}))}{\sum{i=1}^K \exp (- \log p{LM} (\mathbf{a} \ | \ \mathcal{D}_K \setminus { \mathbf{d}_i }, \mathbf{q}))}. $

As before, we then minimize the KL-divergence between this distribution, and the one obtained with retriever. This loss is more expensive to compute than PDist and EMDR, but, like ADist, employs the language model more closely to the way it is trained i.e. the LM is trained to be conditioned on a set of $K$ documents. For LOOP, the language model is conditioned on $(K-1)$ documents, rather than a single document as in EMDR$^2$ and PDist.

For all losses, we can also use a temperature hyper-parameter when computing the target or retriever distributions to control the distribution's peakiness of, which might be important for some tasks or losses. Indeed, for PDist and LOOP, the perplexity of the output may not vary much when conditioning on different documents, especially in the case of long outputs.

2.3 Pretext tasks

In this section, we describe pretext tasks that can be used to jointly pre-train the retriever and the language model using only unsupervised data.

Prefix language modeling.

First, we consider a standard language modeling task as potential pre-training objective. To cast language modeling in the text-to-text framework, we consider a chunk of $N$ words, and split this chunk in two sub-sequences of equal length $N/2$. Then, the first sub-sequence is used as the query, and the second corresponds to the output. We thus retrieve relevant documents by using the first sub-sequence of $N/2$ tokens, to generate the output.

Masked language modeling.

Second, we consider masked language modeling, as formulated by [11]. Again, starting from a chunk of $N$ words, we sample $k$ spans of average length 3 tokens, leading to a masking ratio of $15%$. We then replace each span by a different special token. The model is then trained to generate the masked spans, each span beginning with the special sentinel mask token that was inserted in the input sequence. We retrieve documents using the masked query, but replace the special mask tokens with a mask token supported by the retriever vocabulary.

Title to section generation.

Finally, we consider a more abstractive generation task, generating sections from Wikipedia articles, given the article and section title. Here, the query corresponds to the title of the article, together with the title of the section, and the output corresponds to the text of the section. We exclude sections "See also", "References", "Further reading" and "External links".

2.4 Efficient retriever fine-tuning

Retrieval is facilitated by using a document index, which is a pre-computed collection of the document embeddings for all the documents in the retrieval corpus. When jointly training the retriever and language model, the index needs to be updated regularly, otherwise, the embeddings of the documents stored in the index become stale relative to the updated retriever. This means that we need to recompute the embeddings for the full collection of documents regularly during training to keep the index fresh, which can be computationally expensive for large indices. This is particularly true at fine-tuning time, where the number of training examples could be small relative to the number of documents in the index. Training the retriever could thus add an important computational overhead compared to standard language model finetuning. In this section, we analyse strategies that might make this process more efficient, alleviating the need to re-compute the embeddings of all the documents too often.

Full index update.

Let us start by analysing the overhead due to updating the index, compared to using a fixed retriever. To compare the computation time of different models, we will make the following assumption: the time required to perform a forward pass on a document with a model of $P$ parameters is $O(P)$. While this computation model may seem naive, the main assumption is that document sizes are constant.[^1] Since we split long documents into passages with similar number of words, and use padding when processing documents of different sizes, this assumption is reasonable in practice. Let $K$ be the number of documents that are retrieved and processed by the language model, $P_{\textsc{lm}}$ be the number of parameters of the language model and $B$ the batch size. Each training step has a complexity of $4 \times B \times K \times P_{\textsc{lm}}$.[^2]

[^1]: See [3] for more details about the computation of the FLOPS corresponding to the forward and backward passes of transformer networks.

[^2]: There is a factor 4 to account for the backward pass and activation checkpointing.

Next, let $N$ be the number of documents in the index, and $P_{\textsc{retr}}$ be the number of parameters of the retriever. Then, re-computing the full index has a complexity of $N \times P_{\textsc{retr}}$. If we refresh the index every $R$ training steps, we obtain the following overhead:

$ \frac{N \times P_{\textsc{retr}}}{4 \times B \times K \times P_{\textsc{lm}} \times R}. $

If we use the BERT-base architecture for our retriever and T5-XL for our language model, we get $\frac{P_{\textsc{retr}}}{P_{\textsc{lm}}} \approx \frac{1}{25}$, lading to the overhead:

$ \frac{N}{100 \times B \times K \times R}. $

If we use an index containing $37M$ documents (the size of our Wikipedia index), train with a batch size of $64$ with $20$ retrieved documents and refresh the index every 1000 steps, this results in an overhead of $\sim 30%$.

Re-ranking.

A second strategy is to retrieve a larger number of documents $L$ with the retriever, and to re-embed and rerank these documents with the up-to-date retriever, and pass the resulting top- $K$ to the language model. In that case, the overhead of reranking the top- $L$ documents is equal to $B \times L \times P_{\textsc{retr}}$. Since we perform this operation at every time step, the overhead is equal to

$ \frac{L \times P_{\textsc{retr}}}{4 \times K \times P_{\textsc{lm}}}. $

Using the same assumption as before, we finally get that the overhead is of the order of $\frac{L}{100 \times K}$. If we re-rank 10x more documents than what the language model processes (i.e., $L = 10 \times K$), we get an overhead of $10%$. However, note that if many updates are performed on the retriever, the index might still need to be fully updated, as the true top-k documents may not be retrieved in the top-L results from the stale index. In practice, it is possible to track the positions of the top-K re-ranked documents in the top-L, and estimate when the index needs to be updated.

Query-side fine-tuning.

Finally, the last strategy is to decouple the encoding of the queries and documents. In this case, we fix the parameters corresponding to the document encoder, and only train the parameters corresponding to the query encoder. Thus, the embeddings of documents are fixed, and we do not need to refresh the index, and thus there is no computational overhead. As we will see in practice, the impact of fixing the documents encoder varies greatly for different tasks when a large training dataset is available. For most of the few-shot settings that we consider, query-side finetuning does not have large performance impact, and sometimes even slightly improves performance.

3. Related work

Section Summary: Researchers have explored retrieval methods to enhance language models on knowledge-intensive tasks like question answering, moving from traditional term-matching approaches such as TF-IDF to modern dense neural retrievers like DPR that are often trained jointly with the model, as seen in systems like REALM and RAG, along with memory-augmented variants such as kNN-LM and RETRO. In parallel, few-shot learning has advanced through large pretrained models that enable in-context learning, where GPT-3-style systems perform tasks using natural language prompts and examples without updating parameters. Additional work combines these ideas with prompt optimization or limited fine-tuning to improve performance on new tasks with minimal data.

3.1 Retrieval in natural language processing

Retrieval for knowledge intensive tasks.

Previous work has shown that retrieval improves performance across a variety of tasks such as question answering ([17, 18, 19]), fact checking ([20]), dialogue ([21]) or citation recommendation ([22]). Historically, this information retrieval step was implemented using term-matching methods, such as TF-IDF or BM25 ([23, 24]). For open-domain question answering ([17]), documents are often retrieved from Wikipedia ([18]). Recently, dense retrievers based on neural networks have become popular. These usually follow a dual-encoder architecture ([25, 12, 26]), where queries and passages are encoded independently as vectors, and relevance is computed using the inner product or Euclidean distance. Popular supervised retrievers include DPR ([13]), which is trained to discriminate the relevant passage among negative passages, and extensions such as ANCE ([27]) which improved the hard negatives mining process. We refer the reader to [28] for a survey of dense retrieval techniques.

After retrieval, the relevant documents are processed to produce the final output. In open-domain QA, models can extract a span of text from retrieved documents as the answer ([18, 29, 30, 13]), a method inspired by reading comprehension ([31, 32]). Recently, generating the answer as free-form text, using a seq2seq model conditioned on retrieved documents have become prevalent ([6, 10, 33]). These architectures have also been shown to reduce hallucination in dialogue agents ([34]).

Retriever training.

The need for expensive query-document annotations for training the retriever can be bypassed, by leveraging signals from the language model, or using unsupervised learning. REALM ([5]) and RAG ([6]) jointly train the retriever and language model by modelling documents as latent variable, and minimizing the objective with gradient descent. REALM pre-trains end-to-end with an MLM approach but uses an extractive BERT-style model ([35]). [5] also explore a query-side finetuning at finetuning time to avoid index refreshes, which is also explored in the context of phrase-based retrieval by [36]. [10] proposed to use cross-attention scores as supervision with knowledge distillation. [16] perform joint training of the reader and the retriever by leveraging the perplexity of the output generated by the reader. [16] and [37] both employ salient span masking to pre-train retrievers, leveraging the perplexity and attention scores from the language model. The inverse cloze task was proposed by [38] to pre-train dense retrievers in an unsupervised way. [39] propose a method to train retrieval-augmented generators using a second "informed" retriever with access to the output, which the test-time retriever can be distilled from, and [40] recently proposed a training set filtering/weighting approach to train stronger retrieval-augmented generators. [9] explored different contrastive learning methods to train retrievers, while [41] used recurring spans within a document to create pseudo-positive query-document pairs.

Retrieval-augmented language models.

Continuous cache models ([42]) defines a probability distribution over recent tokens, by computing the similarity between previous and current representations of tokens. This distribution is then interpolated with the distribution of the language model, to improve predictions. Later, the amount of tokens used to compute this distribution was extended to a much larger memory by leveraging approximate nearest neighbors search ([43]). The related kNN-LM model ([44]) replaced LSTMs by transformer networks, and scaled the memory to billions of tokens, leading to strong performance improvements. More recently, RETRO ([8]) extended these by scaling the retrieval memory to trillions of tokens, and changing the model architecture to take retrieved documents as input.

Retrieval-Augmentation with Search Engines.

Recently, different works have proposed to train large language models to interact with a search engine, by generating text queries, and using the retrieved documents as additional context ([45, 46, 47]). In the context of few-shot question answering, [48] used the question to perform a search query, and retrieved documents are added to the prompt of a large language model performing in-context learning.

3.2 Few-shot learning

Few-shot learning, the task of learning from very few examples, has been studied for decades ([49, 50, 51]), but has recently seen an explosion of interest in NLP with the arrival of large pre-trained models, which exhibit emergent few-shot learning abilities ([52]).

In-context Learning with large Language models.

Providing language models with natural language descriptions of tasks, as proposed by [53] has led to significant developments in few-shot learning. GPT-3 ([1]) demonstrated the ability of large language models to perform few-shot predictions, where the model is given a description of the task in natural language with few examples. Scaling model size, data and compute is crucial to enable this learning ability, leading to the further development of large models ([54, 2, 55, 4, 55]). [3] revisited the scaling law from [56], suggesting that training on more data with a smaller model may be more effective, resulting in Chinchilla, a 70B parameter model with improved parameter efficiency.

Few-shot finetuning and prompt-based learning.

The above models perform few-shot learning with in-context instructions without training the parameters of the language model. Few-shot learning can also be accomplished by combining textual templates ("prompts") and various forms of model finetuning, either fully updating a model's parameters, e.g. for classification ([57, 58, 59, 60]) or generation ([61]). Prompts themselves can be optimized, for example by search ([62, 63]) or by only updating parts of the model ([64]), or learning "soft-prompts" ([65, 66]). Due to its simplicity, in this work we either employ simple prompts or simply feed in inputs without preprocessing, and perform full-model finetuning, a method similar to [67].

4. Experiments

Section Summary: In this section, the authors describe experiments evaluating their retrieval-augmented language models on few-shot learning tasks. They outline benchmarks such as KILT for knowledge-intensive activities like question answering and fact-checking, along with MMLU for broad multiple-choice understanding across many subjects, plus other tests for open-domain performance and temporal accuracy. The section also covers pre-training and fine-tuning procedures for their models, leading into ablation studies and final assessments of the main Atlas model on natural language tasks.

In this section, we report empirical evaluations of our language models on few-shot learning. We start by introducing our experimental setup, describing our evaluation benchmarks in Section 4.1, and giving the training details of our models in Section 4.2. Then, we perform an ablation study to compare the different technical choices leading to our main model. We finally evaluate this model, called $\textsc{Atlas}$, on different natural language understanding tasks in few-shot and full dataset settings.

4.1 Benchmarks

To evaluate our retrieval-augmented language models we consider the following benchmarks, which include different tasks.

Knowledge-Intensive Language Tasks (KILT).

First, we use the KILT evaluation suite ([68]), containing 11 datasets corresponding to 5 tasks: fact checking, question answering, dialog generation, entity linking and slot-filling. These different tasks require knowledge about the world to be solved, which can be found in Wikipedia. We evaluate our model on the following tasks and datasets included in KILT: question answering: NaturalQuestions ([19]), TriviaQA ([69]) and HotpotQA ([70]); slot filling: Zero Shot RE ([71]) and T-REx ([72]); entity linking: AIDA CoNLL-YAGO ([73]); dialogue: Wizard of Wikipedia ([21]); and fact checking: FEVER ([20]). The KILT versions of these datasets differ from their original versions, as instances requiring knowledge not present in the August 2019 Wikipedia dump have been removed.

Massively-Multitask Language Understanding (MMLU).

Our second main evaluation benchmark is MMLU ([74]), which contains 57 multi-choice question answering datasets (referred to as domains), sourced from real examinations designed for humans. These cover a very broad range of topics, e.g. high school mathematics, professional law, logical fallacies and clinical knowledge and can be broadly categorized in four subsets: humanities, social sciences, STEM and "other". We focus on few-shot learning, and the authors of the benchmarks suggest to use 5 training examples per domain. Beyond the 5-shot setting, We also consider three additional settings. The first is a zero-shot setting, with no training data at all. The second, which we call multi-task few-shot, is where we train a single model on the 6-shot data from all tasks, hence leading to a training set of 285 examples. The last, which we call transfer learning, leverages additional training examples from other multiple-choice QA tasks provided by the MMLU authors, namely MCTest ([31]), RACE ([75]), ARC ([76]) and OBQA ([77]) leading to a training set of 95k examples.

Additional benchmarks.

Additionally, we report results on the original open-domain versions of the popular NaturalQuestions ([19]), and TriviaQA ([69]) datasets. We also evaluate our model on the original version of FEVER ([20]), which presents fact checking as a three-way classification problem for textual claims (either "Supported": the text is supported by evidence in Wikipedia, "refuted": the claim is not consistent with evidence in Wikipedia, or "not enough info", where there is insufficient evidence to make a judgement). We also perform experiments to assess temporal sensitivity of our models. Here, we construct a dataset from TempLAMA ([78]), consisting of a set of time-sensitive cloze questions on a range of topics, where the answer changes from 2017 to 2020. We assess the accuracy of our models when supplied with a index from 2017 vs 2020 to assess to what degree models faithfully reflect the content of the index supplied to them at test time, and how effective updating the index is as a continual learning or model updateability method.

4.2 Technical details

We now describe the procedure for pre-training and fine-tuning our models. We focus on the setting used for the ablation studies performed in Section 4.3 and Section 4.4. We give more details about the hyperparameters used for our final model later.

Pre-training.

For the pre-training, we initialize the retriever module using the unsupervised Contriever model, which uses the BERT-base architecture. We initialize the language model with the T5 pre-trained weight. As the original T5 pre-trained model included supervised data in the training set, we use the version 1.1 models which were trained on unlabeled text only. Specifically, we initialize from the T5-lm-adapt variants due to their improved stability.

For the ablation studies performed in Section 4.3 and Section 4.4, we use T5-XL which contains 3B weights. We pre-train all our models for 10, 000 iterations, using AdamW with a batch size of 64 and a learning rate of $10^{-4}$ for the reader and $10^{-5}$ for the retriever with linear decay and 1, 000 warmup steps. We refresh the index every 1, 000 steps. This means that the index is recomputed 10 times during the pre-training, leading to an overhead of around 30%, compared to training with a fixed retriever. We set the number of retrieved documents to 20. We detail the hyperparameters used for the training of our final model at the beginning of Section 4.5.

Fine-tuning.

When performing a downstream task, either in a few-shot setting or with a large training set, we employ fine-tuning to adapt our models to these tasks. For the few-shot KILT ablation experiments, we perform a fixed number of fine-tuning iterations, instead of using early-stopping. More precisely, we decided to use 50 iterations for the 64-shot setting and 200 iterations in the 1024-shot setting. In both cases, we use a batch size of $32$ examples, a learning rate of 4 x 10^-5 with linear decay and 5 warmup steps for both the reader and the retriever.

\begin{tabular}{l c cccc cccc}
  \toprule
  && \multicolumn{4}{c}{64-shot} & \multicolumn{4}{c}{1024-shot} \\
  \cmidrule(lr){3-6} \cmidrule(lr){7-10}
  & MLM & NQ & WoW & FEVER & Avg. & NQ & WoW & FEVER & Avg. \\
  \midrule
  Closed-book & 1.083& 6.5 &14.1 & 59.0 &26.5 & 10.7 & 16.5 & 75.3 & 34.2 \\ 
  No Joint pre-training & - & 9.0 & 14.1 & 67.0 & 30.0 & 9.9 & 16.6 & 78.3 & 34.9 \\
  Fixed retriever & 0.823 & 39.9 & 14.3 & 72.4 & 42.2 & 45.3 & $\underline{17.9}$ & 90.0 & $\underline{51.1}$ \\
  ADist & $\underline{0.780}$ & 40.9 & 14.4 & 73.8 & 43.0 & $\underline{46.2}$ & 17.2 & \textbf{90.9} & \textbf{51.4} \\
  EMDR$^2$ & 0.783 & $\underline{43.3}$ & $\underline{14.6}$ & 72.1 & 43.3 & 44.9 & \textbf{18.3} & 85.7 & 49.6 \\
  PDist & 0.783 & \textbf{45.0} & \textbf{15.0} & \textbf{77.0} & \textbf{45.7} & 44.9 & $\underline{17.9}$ & $\underline{90.2}$ & 51.0 \\
  LOOP & \textbf{0.766} & 41.8 & \textbf{15.0} & $\underline{74.4}$ & $\underline{43.7}$ & \textbf{47.1} & $\underline{17.9}$ & 87.5 & 50.8 \\
  \bottomrule
  \end{tabular}

Unlabeled datasets.

Finally, we discuss the unlabeled text datasets that we use to train our models, which form the retrieval index. First, we consider the Dec. 20, 2021 Wikipedia dump, for which we keep the lists and infoboxes, which are linearized by adding a semi-colon separator between the entries. We split articles by section, and split long sections into passages of equal sizes and containing less than 200 words. This leads to a total of 37M passages, containing 78 words in average. We also use documents from the 2020-10 common crawl dump, preprocessed with the CCNet pipeline ([79]). We perform additional document filtering, in a similar fashion to Gopher ([2]). More precisely, we filter documents based on document length, average word length, ratio of alphanumeric characters and number of repeated tokens. This leads to a total of 350M passages. The same passages are used for the index and model pre-training. During pre-training, we ensure the passage we are training on is filtered out from the retrieved documents, to prevent the model from simply retrieving the passage it is de-nosing/generating, and trivially using it to solve the pre-training task.

4.3 Pre-training loss and tasks

We start our ablation study by comparing different pre-training tasks, and objective functions to jointly train the retriever and the language model. Our goal here is to answer the following research questions:

We start by comparing the training objectives of the retriever, introduced in Section 2.2, by pre-training models using the masked language modelling task. We evaluate these models on a subset of the 64-shot and 1024-shot KILT benchmark: NaturalQuestions, FEVER and Wizard of Wikipedia, along with two baselines: a 'closed-book'' (i.e. non-augmented T5) baseline, pre-trained on the same data, and initialized from Contriever and T5-lm-adapt. We report results in Table 1. First, we note the poor performance of the closed-book baseline, indicating the importance of augmentation. Next, we observe that pre-training our model with retrieval is important to obtain good performance on few-shot tasks. Indeed, all models that include retrieval during pre-training strongly outperform the baseline without joint pre-training. Next, we compare a model that was pre-trained with a fixed retriever, and models using the various retriever training objectives. On the MLM validation metric corresponding to the pre-training objective, we observe that jointly training the retriever leads to strong improvements. This effect tends to be less marked on 64-shot downstream tasks, and almost non-existent for 1024-shot. We believe that this is evidence that the biggest impact of pre-training is on the language model, which learns to use and aggregate information from the retrieved documents. Lastly, we do not observe significant systematic differences between the different retriever training objectives. We thus decide adopt use Perplexity Distillation for subsequent experiments, as it tends to be more stable than EMDR$^2$ or ADist, and more computationally efficient than LOOP.

\begin{tabular}{l cccc cccc}
  \toprule
  & \multicolumn{4}{c}{64-shot} & \multicolumn{4}{c}{1024-shot} \\
  \cmidrule(lr){2-5} \cmidrule(lr){6-9}
  & NQ & WoW & FEVER & Avg. & NQ & WoW & FEVER & Avg. \\
  \midrule
  Prefix Language Modelling & 41.0 & 14.5 & 64.9 & 40.1 & \textbf{44.7} & 17.9 & 86.0 & 49.5 \\
  Masked Language Modelling & \textbf{42.7} & \textbf{14.9} & \textbf{69.7} & \textbf{42.4} & \textbf{44.7} & \textbf{18.3} & \textbf{88.8} & \textbf{50.6} \\
  Title-to-section generation & 41.1 & 15.2 & 66.1 & 40.8 & 45.4 & 17.9 & 84.6 & 49.3 \\
  \bottomrule
  \end{tabular}

Next, we compare the different self-supervised pretext tasks introduced in Section 2.3 in Table 2. Here we observe similar results for all three tasks, with a small advantage for masked language modelling. Thus, in what follows, we adopt masked language modelling for pre-training.

\begin{tabular}{ll cccc cccc}
  \toprule
  & & \multicolumn{4}{c}{64-shot} & \multicolumn{4}{c}{1024-shot} \\
  \cmidrule(lr){3-6} \cmidrule(lr){7-10}
  Index & Training data & NQ & WoW & FEVER & Avg. & NQ & WoW & FEVER & Avg. \\
  \midrule
  Wiki & Wiki & \textbf{42.7} & 14.9 & 69.7 & \textbf{42.4} & 44.7 & 18.3 & 88.8 & \textbf{50.6} \\
  Wiki & CC & 40.9 & \textbf{15.3} & 67.3 & 41.2 & \textbf{44.8} & \textbf{18.4} & 88.1 & 50.4 \\
  CC & Wiki & 32.9 & 14.5 & \textbf{72.1} & 39.8 & 37.8 & 17.1 & 85.8 & 46.9 \\
  CC & CC & 38.4 & 14.9 & 70.1 & 41.1 & 42.0 & 17.3 & \textbf{88.9} & 49.4 \\
  \bottomrule
  \end{tabular}

Finally, we consider different combinations of data sources—Wikipedia and common crawl—for the index and training data during pre-training. In all cases, we use the Wikipedia 2021 dump as the index when performing few-shot fine-tuning. We report results in Table 3. First, we observe that using a Wikipedia-based index leads to better downstream performance. There could be two explanations for this: first, as we use Wikipedia for the few-shot tasks, the model might be better adapted when trained using the same data. Another explanation might be that Wikipedia is a higher-quality and denser source of knowledge than common crawl. Second, when using a common crawl index, we observe that pre-training on Wikipedia data leads to lower performance than using common crawl data. We believe that the primary reason is that the distribution mismatch between the two domains leads to generally-less relevant retrieved documents. In turn, this probably means that the pre-training is less efficient, because the language model does not leverage as much information from the documents. In the following, we thus decide to combine the data from both domains for both the index and the pre-training data.

4.4 Fine-tuning

In this section, we perform an ablation study on how to apply our models on downstream tasks, which relies on fine-tuning. In particular, we want to investigate the following research question:

To answer this question, we compare the different strategies to fine-tune the retriever module, described in Section 2.4. We report results in Table 4. First, as for pre-training, we observe that keeping the retriever fixed during fine-tuning leads to a significant performance drops, for both 64- and 1024-shot settings. Second, the re-ranking strategy (row 2) leads to very similar results to fully updating the index (row 1), while being significantly more efficient. Lastly, fine-tuning only the query encoder also leads to strong results: in particular, in the 64-shot setup, this is slightly stronger than performing full fine-tuning, which we attribute to there being less opportunity for over-fitting. On the other hand, on the 1024-shot setting, performing a full fine-tuning leads to stronger results, especially on NaturalQuestions. In the following, we thus use query-side fine-tuning for experiments with small numbers of examples, and standard fine-tuning for larger datasets.

\begin{tabular}{l cccc cccc}
  \toprule
  & \multicolumn{4}{c}{64-shot} & \multicolumn{4}{c}{1024-shot} \\
  \cmidrule(lr){2-5} \cmidrule(lr){6-9}
  & NQ & WoW & FEVER & Avg. & NQ & WoW & FEVER & Avg. \\
  \midrule
  Standard fine-tuning & 44.3 & 14.9 & 73.2 & 44.1 & 47.0 & 18.4 & 89.7 & \textbf{51.7} \\
  Top-100 re-ranking & 44.2 & 14.6 & 75.4 & \textbf{44.7} & \textbf{47.1} & \textbf{18.7} & 88.9 & 51.6 \\
  Query-side fine-tuning & \textbf{45.0} & \textbf{15.0} & \textbf{77.0} & \textbf{45.7} & 44.9 & 17.9 & \textbf{90.2} & 51.0 \\
  Fixed retriever & 36.8 & 14.5 & 72.0 & 41.1 & 38.0 & 17.7 & 89.3 & 48.3 \\
  \bottomrule
  \end{tabular}

4.5 Training and evaluating $\textsc{Atlas}$

In this section, we apply the findings from the ablations of the previous sections to train a family of $\textsc{Atlas}$ models, ranging from 770M to 11B parameters. More specifically, we use the Perplexity Distillation objective function, along with the masked language modelling pretext task. We pre-train these models using a mix of Wikipedia and Common Crawl data, for both the training data and content of the index. We retrieve 20 documents, and update the index every 2, 500 steps and perform re-ranking of the top-100 documents. We pre-train models for 10, 000 iterations using AdamW with a batch size of 128.

\begin{tabular}{l ccc ccc ccc}
  \toprule
  & \multicolumn{3}{c}{5-shot} & \multicolumn{3}{c}{5-shot (multi-task)} & \multicolumn{3}{c}{Full / Transfer} \\
  \cmidrule(lr){2-4} \cmidrule(lr){5-7} \cmidrule(lr){8-10}
  & 770M & 3B & 11B & 770M & 3B & 11B & 770M & 3B & 11B \\
  \midrule
  Closed-book T5 & 29.2 & 35.7 & 36.1 & 26.5 & 40.0 & 43.5 & 42.4 & 50.4 & 54.0\\
  \textsc{Atlas} & 38.9	&	42.3	&	43.4	&	42.1	&	48.7	&	56.4	&	56.3	&	59.9	&	65.8 \\
\midrule 
 $\Delta$ & +9.8	&	+6.6	&	+7.3	&	+15.6	&	+8.7	&	+12.9	&	+13.9	&	+9.5	&	+11.8\\
  \bottomrule
  \end{tabular}

4.5.1 MMLU

Results

As mentioned in Section 4.1, we consider four setting for MMLU: 1) a zero-shot setting where we directly apply the pretrained model with no few-shot finetuning 2) a 5-shot setting, where we finetune a model using 5 training examples for each of the 57 domains 3) a 5-shot multitask setting, where, rather than finetuning a model independently for each domain, we train a single model to perform all tasks and 4) a setting with access to a number of auxiliary datasets, with 95K total training examples. We train the models to generate the letter corresponding to the correct answer option (' $A'$, $B'$, 'C' or $D')$, and pick the answer with the most likely of the 4 letters at test time. Full technical details can be found in Appendix A.1.

Performance vs Parameters.

We start by comparing $\textsc{Atlas}$ to closed-book models of different sizes for 5-shot, 5-shot multitask and the full setting, and report results in Table 5. Across these settings, $\textsc{Atlas}$ outperforms the closed-book baselines by between 6.6 and 15.6 points, demonstrating consistent utility of retrieval for few-shot language understanding across 57 domains. The closed-book T5 struggles to perform significantly better than random (25%) in few-shot settings with 770M parameters, whereas the equivalent $\textsc{Atlas}$ achieves around 40%, significantly better than random, despite its small size. All models improve with more data, but interestingly, the 770M models do not benefit as much from few-shot multitask learning compared to larger models (for closed-book, it actually loses 3 points) suggesting smaller models struggle to grasp the synergies between the tasks in the few-shot setting. Larger models exploit the multi-task setting well, with $\textsc{Atlas}$ improving more than closed-book. For example, $\textsc{Atlas}$-11B improves by 13 points (43.4 $\rightarrow$ 56.4), but equivalent closed-book only improves by 7 (36.1 $\rightarrow$ 43.5). Finally, on the transfer learning setting, all models improve, but the relative gaps between closed-book at $\textsc{Atlas}$ models remain similar.

De-biasing.

When finetuning, we permute which answer option appears with which answer letter to reduce over-fitting and encourage a uniform prior over answer letters. However, the model may still exhibit a bias towards some letters, especially in few-shot settings, so we also include a second 'de-biased'inference mode in addition the standard inference used above. Here, we run 4 forward passes, one for each cyclic permutation of the answer letter-answer option assignment in the question, e.g. the answer option assigned to letter $A'$ becomes 'B', what was $B'$ becomes ` $C'$ etc.[^3] We then sum the 4 probabilities to obtain the final prediction, which reduces spurious bias towards one of the answer letters (further details in Appendix A.1). The results are shown in Table 6. We find that in zero-shot and 5-shot settings, de-biasing is very effective, improving results by 10.3 and 4.5 points respectively. When more training data is available, the need for de-biasing decreases, leading to only 0.2 point improvement in the multi-task and full data settings.

[^3]: Exploring all answer option permutations would involve 24 forward passes, which improves results by an additional $\sim$ 1% over the 4 cyclic permutations, but requires much more compute, so we exclude it here, see Appendix A.1

Comparison to published works

Next, we compare our $\textsc{Atlas}$-11B results with de-biasing to recently reported results with state-of-the-art large language models such as GPT-3 or Chinchilla, which required significantly more amount of computation to train. We report results in Table 7. We find that $\textsc{Atlas}$ is able to perform significantly better than random in zero-shot, and in conjunction with de-biased inference, achieves zero-shot scores that exceed 5-shot results reported with GPT3 in the literature (47.1% vs 43.9%) ([74]). For the 5-shot setting, $\textsc{Atlas}$ outperforms GPT-3 by 4%, while using 15 $\times{}$ less parameters, and 10 $\times{}$ less pre-training compute.[^4] When multitask-training on the combined 5-shot data, $\textsc{Atlas}$ improves to 56.6% close to the 5-shot performance of Gopher (60.0%). Finally, on the full data setting, where we train on auxiliary data recommended by the MMLU authors, $\textsc{Atlas}$ reaches an overall accuracy of 65.6%, close to the state-of-the-art. Interestingly, in this setup, $\textsc{Atlas}$ significantly outperforms GPT-3, while on the 5-shot setting, their performance is similar.

[^4]: $\textsc{Atlas}$ 's pre-training compute is dominated by the T5 pre-training. The computational requirements for the retrieval-augmented pre-train is orders of magnitude lower

: Table 6: Standard vs de-biased inference for MMLU These results are reported for $\textsc{Atlas}$-11B, using cyclic permutations for de-biasing, which increases inference costs by a factor of 4 ×.

Zero-shot 5-shot 5-shot (multi-task) Full / Transfer
Standard Inference 36.8 43.4 56.4 65.8
De-biased Inference 47.1 47.9 56.6 66.0
\begin{tabular}{ll cc c cccc}
  \toprule
  Setting & Model & Params & Train \textsc{FLOPS} & All & Hum. & Soc. Sci. & STEM & Other \\
  \midrule
  \multirow{1}{*}{zero-shot} & \textsc{Atlas} & 11B & 3.5e22 & 47.1 & 43.6&	54.1&	38.0&	54.4 \\
  \midrule
  \multirow{4}{*}{5-shot} & GPT-3 & 175B & 3.1e23 & 43.9 & 40.8 & 50.4 & 36.7 & 48.8 \\
  & Gopher & 280B & 5.0e23 & 60.0 & 56.2 & 71.9 & 47.4 & 66.1 \\
  & Chinchilla & 70B & 5.0e23 & \textbf{67.5} & \textbf{63.6} & \textbf{79.3} & \textbf{55.0} & 73.9 \\
  & \textsc{Atlas}$^*$ & 11B & 3.5e22 & 47.9 & 46.1	&54.6&	38.8&	52.8 \\
  \midrule
  \multirow{1}{*}{5-shot (multi-task)} & \textsc{Atlas} & 11B & 3.5e22 & 56.6 & 50.1&	66.4&	46.4&	66.2 \\
  \midrule
  \multirow{3}{*}{Full / Transfer} & UnifiedQA & 11B & 3.3e22 & 48.9 & 45.6 & 56.6 & 40.2 & 54.6 \\
  & GPT-3 & 175B & 3.1e23 & 53.9 & 52.5 & 63.9 & 41.4 & 57.9 \\
  & \textsc{Atlas} & 11B & 3.5e22 & 66.0 & 61.1&	77.2&	53.2&	\textbf{74.4}\\
  \bottomrule
  \end{tabular}

4.5.2 Open-domain Question Answering Results

Next we evaluate $\textsc{Atlas}$ on two open-domain question answering benchmarks: NaturalQuestions and TriviaQA. We compare to prior work, both in a few-shot setting using 64 examples, and using the full training set, and report results in Table 8. On these benchmarks, which require high-degree of memorisation, we clearly see the benefits of retrieval-augmentation. $\textsc{Atlas}$-11B obtains state-of-the-art results on 64-shot question answering, for both NaturalQuestions and TriviaQA. In particular, it outperforms significantly larger models, such as PaLM, or models that required significantly more training compute such as Chinchilla. When using the full training set, $\textsc{Atlas}$ also obtains state-of-the-art results, for example improving the accuracy on NaturalQuestions from 55.9% to 60.4%. This result is obtained using an index comprised of CCNet and the December 2021 Wikipedia corpora, our default setting for the index. In Section 5.2 we consider using indexes composed of Wikipedia corpus archived at different dates, and demonstrate an additional +3.6% on NaturalQuestions when using an index which is temporally matched to NaturalQuestions. We report performance as a function of model size as well as detailed hyperparameters in Appendix A.2.

$\textsc{Atlas}$ also compares favorably to recent work exploring retrieval-augmented few-shot question answering with very large models. [48] explore NaturalQuestions in a 15-shot setup using Gopher, augmenting questions with 50 passages retrieved using Google Search. This method consists of generating 4 candidate answers from each retrieved passages, and then re-ranking using either a score inspired by RAG ([6]) or a more expensive approach. This method (not shown in our tables) achieves exact match scores of 32.7% (RAG) and 38.4% (Ensemble), requiring 50 (RAG) or 450 (Ensemble) forward passes of Gopher-280B per test-time question. $\textsc{Atlas}$, using the same 15 training examples and 50 passages achieves 38.7 EM, despite having 25 $\times{}$ fewer parameters, and requiring comparatively negligible compute.

\begin{tabular}{l cc cc cc}
  \toprule
  & \multicolumn{2}{c}{NQ} & \multicolumn{2}{c}{TriviaQA filtered} & \multicolumn{2}{c}{TriviaQA unfiltered}\\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} 
  Model & 64-shot & Full & 64-shot & Full & 64-shot & Full \\
  \midrule
  GPT-3 ([1]) & 29.9 & - & - & - & 71.2 & - \\
  Gopher ([2]) & 28.2 & - & 57.2 & - & 61.3 & - \\
  Chinchilla ([3]) & 35.5 & - & 64.6 & - & 72.3 & - \\
  PaLM ([4]) & 39.6 & - & - & - & 81.4 & -\\
  RETRO ([8]) & - & 45.5 & - & - & - & -\\
  FiD ([10]) & - & 51.4 & - & 67.6 & - & 80.1 \\
  FiD-KD ([15]) & - & 54.7 & - & 73.3 & - & - \\
  R2-D2 ([80]) & - & 55.9 & - & 69.9 & - & - \\
  \textsc{Atlas} & \textbf{42.4} & \textbf{60.4} & \textbf{74.5} & \textbf{79.8} & \textbf{84.7} & \textbf{89.4} \\
  \bottomrule
  \end{tabular}

4.5.3 FEVER Results

We report results on the original 3-class FEVER fact checking test set in Table 9. We consider a 64-shot setting, with training examples uniformly sampled from the full training set. Unlike the development and test sets, the train set is imbalanced, with more positive labels than negative, posing a challenge for few-shot learning. In this setting, we achieve an accuracy of 64.3%. We also report a 15-shot setting, with 5 examples uniformly sampled from each class to compare with published results from Gopher ([2]), where $\textsc{Atlas}$ scores 56.2%, outperforming Gopher by 5.1 points. Lastly we fine-tune our model on the full training set, and achieve a score of 78%, within 1.5% of the ProoFVer, which uses a specialized architecture, a retriever trained with sentence-level annotations, and is supplied with the Wikipedia corpus released with FEVER, whereas $\textsc{Atlas}$ retrieves from CCNet and the December 2021 Wikipedia dump. If we give $\textsc{Atlas}$ an index comprised of the FEVER Wikipedia corpus, we set a new state-of-the-art of 80.1%

: Table 9: Comparison to state-of-the-art on FEVER. We report accuracy on FEVER test set, for which evaluation is available here: https://competitions.codalab.org/competitions/18814. For the few-shot settings, our model uses fine-tuning while other models use prompting. †uses an index composed of the FEVER Wikipedia corpus.

15-shot 65-shot Full dataset
Gopher([2]) 51.1 - -
ProoFVer([81]) - - 79.5
Atlas 56.2 64.3 78.0 / 80.1

4.5.4 KILT

Results

Finally we evaluate $\textsc{Atlas}$ on KILT, a benchmark composed of several different knowledge intensive tasks, which was described in Section 4.1. We report results on test sets in Table 10 for which evaluation is available online^5. The KILT versions of datasets are filtered, and thus results for datasets we have evaluated elsewhere are not directly comparable on KILT (i.e. FEVER, NQ and TQA). We consider both a 64-shot setting and a full fine-tuning setting, in both cases we train $\textsc{Atlas}$ individually on each dataset. More details on the hyperparameters and development set results are reported in Appendix A.3. For 64-shot, we greatly exceed random performance, and are even competitive with some fully-finetuned models on the leaderboard, such as for FEVER, where our 64-shot $\textsc{Atlas}$ is only 2-2.5 points behind Sphere, SEAL and Re2G, and outperforms Sphere and SEAL on zero-shot RE. In the full dataset setting, $\textsc{Atlas}$ is within 3% to the state-of-the-art for 3 datasets, and sets the state-of-the-art in the remaining five datasets.

::: {caption="Table 10: Downstream results on the KILT hidden test sets Downstream metrics are accuracy (AIDA CoNLL-YAGO, FEVER, T-REx, zero-shot RE), exact match (Natural Questions, HotpotQA, TriviaQA), or F1 (Wizard of Wikipedia)."}

:::

5. Analysis

Section Summary: Atlas retrieves most of its passages from web data rather than Wikipedia, and these passages often supply background details or exact facts that help the model reach the right answer on benchmark questions. The analysis also checks for test-question leakage from the training corpus and finds only a small amount, which has little impact on overall scores once removed. In addition, swapping the retrieval index lets Atlas update its knowledge to handle time-sensitive facts without any retraining, giving it an advantage over models whose knowledge is frozen after training.

5.1 Interpretability and Leakage

An advantage of semi-parametric models like $\textsc{Atlas}$ is the ability to inspect retrieved items to aid interpretability. To better understand how well $\textsc{Atlas}$ retrieves, and how it uses retrieved passages, we examine the retrieved passages for multi-task few-shot MMLU. As shown in the left panel of Figure 3, the model retrieves the majority of its passages from CCNet (85% on average). Wikipedia makes up about 15% of retrieved passages, which is higher than we would expect under a uniform prior, given Wikipedia only makes up about 10% of the index. The fraction of Wikipedia retrieval varies between MMLU domains, with the model using Wikipedia to a greater extent for STEM domains, and least for social sciences. The domain making the greatest use of Wikipedia is "abstract algebra" (73%), and the least is "moral scenarios" (3%). We also note that the MMLU-finetuned $\textsc{Atlas}$ does not make significant use of Wikipedia infobox passages.

We can also analyse the content of passages to assess how they may useful for accomplishing the downstream task. The middle panel of Figure 3 shows how often retrieved documents contain the text of the correct answer option. There being at least one mention of the correct answer choice in 30% of test questions in the top 25 passages. [^6] The right panel shows that the accuracy on MMLU increases when the correct answer option text occurs more frequently in retrieved passages, rising from 55% for questions when the answer option does not appear, to 77% for questions mentioned more than 15 times.

[^6]: Note: Depending on the question, it may not be important or useful to retrieve the exact text of the answer in MMLU, and as such, a hits@k value of 30% does not imply that retrieval fails to surface useful information in 70% of cases

A human analysis of retrieved documents revealed that documents are helpful for answering questions in a number of different ways. Manual inspection of a sample of 50 correctly-answered questions revealed that 44% contained at least partially useful background information. These are documents that would improve the likelihood of a non-expert human answering correctly, such as contextual clues surrounding a quotation from a question, or helpful numerical figures for quantity-based questions, which help to narrow down the answer options to a smaller range. In a further 26% of cases, a passage contained all the necessary information to answer the question, stated in a straightforward way. If read competently, such passages make the question simple to answer, and often include information such as canonical definitions, or the exact numerical answer requested in the question. 28% of retrieval sets did not contain obvious information which would make the question easier. Finally, 2% contained the verbatim question in a passage, together with its answer.

**Figure 3:** **MMLU Retrieval Analysis. Left**: Fraction of sources of top 30 retrieved passages for MMLU from CCNet, Wikipedia passages and info boxes for the 5-shot multitask $\textsc{Atlas}$. **Center**: How often the text of the correct MMLU answer option appears in retrieved passages, as a function of the number of retrieved passages. **Right**: MMLU accuracy as a function of answer occurrence frequency in retrieved passages set

Given that MMLU has been created from pre-existing exams, it is possible that these questions appear on the open web. Models trained on web data (or, in our case, retrieving from it) run the risk of answering correctly not through generalisation, but by verbatim memorisation, which could lead to misleadingly high scores. In some very large language models, which can verbatim memorize and recall large parts of their pre-training data ([86]), efforts have sometimes been made to filter occurrences of downstream instances from pre-training data, but this has not been performed for MMLU in the literature. In order to assess the prevalence of MMLU leakage in our index, we manually checked retrieval results for questions where the longest n-gram overlap between the question (without answer options) and a passage was at least 75% the length of the question. This resulted in an estimate of leakage of 2.8% of questions from our CC-Net corpus.

A benefit of retrieval-augmented models such as $\textsc{Atlas}$ is the editability of its knowledge (see Section 5.2 for additional analysis). To estimate pure, non-leaked performance, we can filter out any potentially-leaked passages from retrieved results and rerun the language model. The MMLU score drops slightly when controlling for this leakage from 56.4 to 55.8% (-.5%). We note that our CC-net corpus is relatively small compared to the pre-trained corpora of recent very large models, which are trained on up to 1.4 trillion tokens ([3]), 35x the size of our index, making it likely that models trained on corpora of that size would observe more MMLU leaked examples, but detecting such leakage is challenging in non-retrieval augmented models.

5.2 Temporal Sensitivity and Updateability

A benefit of retrieval-augmented models is that they can be kept up-to-date without retraining, by updating or swapping their index at test time. To assess the effectiveness of this mechanism in $\textsc{Atlas}$, we first construct a dataset of time-sensitive questions derived from TempLAMA ([78]). TempLAMA is a collection of templated cloze questions derived from Wikidata and Wikidata where the correct answer changes over time. We select a subset of questions from this dataset which have a different answer in 2017 and 2020, for example, Question: Theo Walcott plays for ___ Answer: Arsenal F.C. (2017), Everton F.C. (2020), and form a small training set of 248 training, 112 development and 806 test questions.

Using this dataset, we finetune closed-book T5-XXL and $\textsc{Atlas}$ using the questions and the 2017 answers, supplying $\textsc{Atlas}$ with a 2017 Wikipedia index, and then measure exact match accuracy on the 2017 test set. The results can be found in the first row and first two columns of Table 11. We first observe that, as expected, $\textsc{Atlas}$ greatly outperforms T5 (57.7% c.f. 12.1%). We also note that, as desired, both T5 and $\textsc{Atlas}$ almost never generate an answer from 2020 when trained with the 2017 answers, scoring 2.8% and 1.5% respectively (first row, second two columns of Table 11). However, as shown in row 2, we can swap the $\textsc{Atlas}$ index to a 2020 Wikipedia index, without retraining, and find that $\textsc{Atlas}$ updates its predictions accordingly, with 2020 accuracy rising to a similar level to its 2017 performance (53.1%), whereas the purely parametric T5 has no such updateability mechanism.

This demonstrates that $\textsc{Atlas}$ can be faithful and condition strongly on its supplied index. Furthermore, this zero-shot updateability mechanism has the useful property of staying up-to-date without requiring up-to-date annotated data, or continuous, lifelong pre-training, as would be may required for a large parametric-only model. Rows 3 and 4 of Table 11 complete the picture, where this time we train with 2020 answers, and demonstrate $\textsc{Atlas}$ can zero-shot transfer backwards in time to 2017 effectively too (50.1%). Interestingly, T5 is unable answer questions from 2020 well, even when trained with 2020 answers (3.6%), likely because it was pre-trained on data pre-dating 2020 ([87]).

\begin{tabular}{l c cc cc}
  \toprule
  && \multicolumn{2}{c}{2017 Test Set Acc.} & \multicolumn{2}{c}{2020 Test Set Acc.} \\
  \cmidrule(lr){3-4} \cmidrule(lr){5-6} 
  Train Set & Test-time Index & Closed-book & \textsc{Atlas} &Closed-book & \textsc{Atlas} \\
  \midrule
\multirow{2}{*}{2017 answers} & 2017 & 12.1 & 57.7 & 2.9 & 1.5\\
& 2020 & 12.1 & 10.2 & 2.9 & 53.1 \\
  \midrule
\multirow{2}{*}{2020 answers} & 2017 & 4.8

& 50.1 & 3.6& 4.2\\
& 2020 & 4.8
& 3.5 & 3.6 & 60.5\\
  \bottomrule
  \end{tabular}

We also examine temporal effects for NaturalQuestions. NaturalQuestions is a dataset composed of search queries collected via the Google search engine in a short period of time. Thus data have a strong temporal bias, with a lot of questions about the 2018 World Cup for example. Moreover some questions are ambiguous without specification of the temporal context. For instance, for the question "when did ireland last beat england at twickenham", the expected answer is 2018 in NaturalQuestions, while Ireland also beat England at Twickenham in 2022 as well as many other times before. In Table 12, we report results obtained by finetuning $\textsc{Atlas}$ using different Wikipedia dumps for the index. We observe that the 2018 December Wikipedia dump, which is close to the date of data collection, leads to the best results for both few-shot and full fine-tuning. In particular, it leads to a new state-of-the-art of 64 EM on NaturalQuestions.

: Table 12: Impact of index data temporality on NaturalQuestions. We report exact match performance on NaturalQuestions using different Wikipedia dumps in the index. We observe that the dump from December 2018, commonly used for NaturalQuestions, leads to the best result.

Dec. 2017 Dec. 2018 Aug. 2019 Dec. 2020 Dec. 2021
64-shot 44.7 45.1 44.1 44.0 41.3
Full 63.2 64.0 62.4 61.1 59.6

5.2.1 Index Compression

**Figure 4:** **Index Compression:** $\textsc{Atlas}$-3B 64-shot NQ performance (left column: Retrieval Recall@50, right column: QA Exact Match score), as a function of index size, for different levels of quantisation. The right-most point in each plot represents the uncompressed index. **Top Row:** Wikipedia + CC Index. **Bottom Row:** Wikipedia Index.

Maintaining dense retrieval indices can be memory-intensive, especially as the number of indexed items is scaled. In this section, we briefly analyse the memory requirements of $\textsc{Atlas}$ 's index in the case of a) a Wikipedia index and b) the combined CC and Wikipedia index used in most of the experiments above.

There are two sources of memory pressure for $\textsc{Atlas}$ 's retrieval component – the passages themselves, and the document embedding index. The tokenized passages, once binarized, require 11GB and 130GB of storage for the Wikipedia and combined indices respectively. These passages do not need to be stored in expensive GPU RAM, and could even be memory-mapped to disk, sharded across nodes or compressed if required, and thus do not represent a limiting hardware challenge in this context. The embedding index itself, however, must be stored in GPU RAM for fast search, and thus its size is more sensitive. In the above experiments, we perform exact search over our index, which is achieved by sharding the index over all the the available GPUs, and computing the search in parallel. The index is stored at fp16 precision, resulting in a total GPU memory requirement of 49 GB and 587 GB for the Wikipedia and combined indices, respectively.

This large GPU memory requirement for the index limits accessibility and ease of deployment. However, many index compression techniques are available for nearest neighbour search, which can often dramatically reduce memory requirements at the cost of some retrieval accuracy. Following [88], we explore the effect of Product Quantization [PQ, 89], a popular lossy compression technique on $\textsc{Atlas}$-3B's accuracy for the 64-shot NQ task at different compression levels.

The results are shown in Figure 4. We find that substantial compression is possible before the onset of significant performance degradation. Namely, the Wikipedia index can be compressed from 49GB to 4GB with negligible drop in retrieval precision and exact match. Likewise, the combined index can be compressed from 587GB to 50GB without serious degradation, indicating that the combined index could be loaded onto a single 80GB GPU.

6. Discussion

Section Summary: The paper presents Atlas, a retrieval-augmented language model created by jointly training its retriever and language components. This design delivers strong performance on knowledge-intensive tasks using only a small number of examples, sometimes surpassing far larger models that require much more computing power. The authors also examine key training choices, show that the system can be updated and interpreted more easily, and report new state-of-the-art results when trained on full datasets.

In this paper, we introduce $\textsc{Atlas}$, a large retrieval-augmented language model. By jointly pre-training the retriever module and the language model, we show that $\textsc{Atlas}$ has strong few-shot learning capabilities on a wide range of knowledge intensive tasks, including NaturalQuestions, TriviaQA, FEVER, 8 KILT tasks and 57 MMLU tasks. For example, $\textsc{Atlas}$-11B reaches more than 42% accuracy on NaturalQuestions and 84.7% on TriviaQA when training on 64 examples, which is an improvement of almost 3 points compared to PaLM, a 540B parameters model, which required 50x more pre-training compute. We also provided detailed ablations and analyses for what factors are important when training such retrieval-augmented models, and demonstrated $\textsc{Atlas}$ 's updateability, interpretability and controlability capabilities. Lastly, we demonstrated that $\textsc{Atlas}$ is also powerful in full-dataset settings obtaining a new state-of-the-art results on NaturalQuestions, TriviaQA, FEVER, and 5 KILT tasks.

Appendix

Section Summary: The appendix section details the training setup, inference procedures, and supporting experiments for models evaluated on the MMLU benchmark, which involves multiple-choice questions across many subjects. It explains how questions and answer choices are formatted during training to match pre-training patterns, how answer-letter bias is mitigated through exhaustive or cyclic permutations of options at inference time, and how performance gains from these steps were measured. Additional notes cover data sources for retrieval, hyperparameter selection via related datasets, and comparisons showing the value of broader document indices.

A. Training details and additional results

A.1 MMLU

A.1.1 Training Details

Featurization

MMLU consists of multiple choice questions with four possible lexicalized answer options. We represent the input using the following template:

`question: question text

options: (A) answer 1 text (B) answer 2 text (C) answer 3 text (D) answer 4 text answer: [MASK_0] `

and train the model to generate the mask token followed by the letter of the correct answer:

[MASK_0] correct answer option letter

This format closely matches the format of MLM pre-training objective, aiding few-shot learning. When training, we permute the order of the answer options, i.e. shuffling which answer option appears as letter A etc. This helps reduce overfitting, and encourages a uniform prior on the letters.

Standard inference

Once trained we obtain predictions from the model by selecting the pre-softmax logits for the tokens A, B, C and D, and performing a softmax over them to obtain a distribution over the 4 answer options. For standard inference, we then simply return the answer corresponding to the argmax of this distribution.

De-biased Inference

As mentioned in the main text, even though our model is finetuned with data that encourages a uniform prior over answer letters (by permuting which answer option letter is used with which lexical answer option text in training data), this may not be enough to ensure the model has no residual bias towards specific letters. Consider answers $a$, questions $q$ and a nuisance variable $z \in \mathcal{Z}$, which represents the ordering of the answer options or, equivalently, which answer letter gets assigned to which answer option text. There are 4 answer options in MMLU, and thus $|\mathcal{Z}| = 24$ unique ways they can be ordered, or assigned to given letters. Running our model with our standard inference for a question $q$, corresponds to calculating $p(a|q=q, z=z)$ for the answer ordering $z$ that happens to appear in the dataset. We can control for $z$ by running the model with all possible answer orderings in the input, and marginalizing: $p(a|q=q) = \sum_{z' \in \mathcal{Z}} p(a|q=q, z=z') p(z=z'|q=q)$, and assuming $p(z=z'|q=q)$ is uniform (no answer ordering is more likely than another), this reduces to simply $p(a|q=q) \propto \sum_{z' \in \mathcal{Z}} p(a|q=q, z=z')$. This procedure requires 24 forward passes, one for each answer ordering, so is 24 $\times$ slower than standard inference. Table 13 shows the result of applying the full permutation de-biasing, which leads to an 12% improvement zero-shot and 6% in 5-shot performance overall. Empirically, using only the cyclic permutations of the answer order provided in the original dataset (of which there are 4) works nearly as well, which is what we report in the main paper, and only increases inference compute by a factor of 4, rather than 24. Cyclic permutation de-biasing improves over standard inference by 10% in zero-shot and 5% in 5-shot. Empirically, de-biased inference is largely unnecessary when training in the 5-shot multitask or full dataset setting, as there is enough data for the model to learn a more uniform prior over the letters.

\begin{tabular}{ll c cccc}
  \toprule
  Setting& Model & All & Hum. & Soc. Sci. & STEM & Other \\
  \midrule
\multirow{3}{*}{zero-shot}
& Standard &36.8	&37.5&	39.0&	30.2&	39.7\\
&All permutations& 48.5&	45.7&	55.2&	39.4&	54.4\\
& Cyclic Permutations& 47.1&	43.6&	54.1&	38.0&	54.9\\
  \midrule
  \multirow{3}{*}{5-shot}
& Standard & 43.4 &	41.8&	49.3&	33.9&	48.8 \\
& All permutations & 49.0&	46.0&	56.1&	40.5&	54.6 \\
& Cyclic Permutations & 47.9&	46.1&	54.6&	38.8&	52.8 \\

  \bottomrule
  \end{tabular}

Evaluation

We evaluate by following the method of [74], namely, micro-averaging across all 57 domains to obtain overall accuracy. We quote the results of GPT3 ([1]) and UnifiedQA ([90]) from the MMLU leaderboard at https://github.com/hendrycks/test. For Chinchilla and Gopher, we calculate the scores on the categories using the full MMLU results from [3].

Index

The index used for MMLU for all MMLU experiments in the main paper comprised of concatenation of the Wikipedia passages, Wikipedia info boxes and Common Crawl indices, for a total of 387M passages. We can assess the importance of the index by running a model without the common crawl data, leading to a 5-shot multitask result of 52.8%, compared to 56.4% for the full model, a drop of 3.6%. This indicates that whilst the Wikipedia data is sufficient do well on the task, the addition of the CC data improves results further.

Hyperparameters and development data

Selecting hyperparameters is challenging in few-shot settings. We do not assume access to an in-domain development set for the 5-shot task. Instead, we determine a set of hyperparameters for the 5-shot task using data from RACE, one of the auxiliary datasets provided by MMLU. Here, we sample 5 sets of 5-shot training data, and for each model size, we explore batch size ${32, 64}$, learning rates for the language model and retriever (5e-5, 1e-5), (4e-5, 4e-5), retriever temperature ${0.1, 0.01}$ and a fixed number of training steps ${16, 32, 64, 128}$, picking the setting that achieves strongest RACE validation scores. Having determined these hyperparameters, we apply them directly to the 5-shot MMLU task. For the 5-shot multi-task and full/transfer settings, we use the same batch size, temperatures and learning rates as the 5-shot task, but use a set of 285 MMLU validation examples (5 per domain) in order to determine the total number of training steps and for early stopping. The hyperparameters selected in the MMLU experiments can be found in Table 14. We use query-side finetuning for the 5-shot and 5-shot multitask settings, and top-128 reranking for the full setting. For all MMLU runs we retrieve 30 documents

: Table 14: Hyperparameters for MMLU

770M 3B 11B
Batch size 64 64 64
Learning rate (5e-5, 1e-5) (5e-5, 1e-5) (5e-5, 1e-5)
Retriever Temperature 0.1 0.1 0.1
5-shot train steps 64 32 16
5-shot (multitask) max train steps 2000 500 250
Full / transfer max train steps 5000 2000 2000

Inter-run Variance

few-shot learning is well-known to suffer from high variance. In the main paper, we quote the result obtained with our first run. In order to assess the effect of noise and variance, we ran the 5-shot experiment with $\textsc{Atlas}$ 5 times.[^7] We observe high variance for individual domains, sometimes as high as 20%, however, once aggregated across all 57 domains, the inter-run variance is low. The overall scores for these different runs, when using the same hyperparameters are shown in Table 15. Due the effects of averaging over the many domains that comprise MMLU, the inter-run variance is quite modest on the aggregated metrics, with a std deviation of 0.5 in this experiment.

[^7]: This experiment was performed with a slightly different index to the main experiments, which achieves a stronger result

\begin{tabular}{c ccccc}
  \toprule
Run # & All & Hum. & Soc. Sci. & STEM & Other \\
  \midrule
1 & 45.2&	40.6&	54.1&	37.1&	51.1\\
2 & 45.1&	39.8&	54.4&	37.1&	52.0\\
3 & 45.0&	40.0&	54.1&	37.7&	51.1\\
4 & 45.6&	41.3&	54.7&	37.0&	51.6\\
5 & 44.3&	40.6&	50.7&	38.1&	49.8\\
\midrule
Ave: & $45.0\pm0.5$ & $40.5\pm0.6$ & $53.6\pm1.6$ &	$37.4\pm0.5$ &	$51.1\pm0.8$ \\
  \bottomrule
  \end{tabular}

Closed-Book Baselines

The closed book baselines we compare $\textsc{Atlas}$ to in Table 5 are initialized from the same T5 model as their respective $\textsc{Atlas}$, and then pre-trained with MLM for the same number of steps (10K) using the same pre-training data as $\textsc{Atlas}$, for fairer comparison. The same procedure as for $\textsc{Atlas}$ was used to determine hyperparameters for MMLU for the closed-book model.s

A.1.2 Full results

Table 16 and Table 17 shows the full MMLU scores for each domain for $\textsc{Atlas}$ and the closed book T5 respectively. The full results for the cyclic-permutation-de-biased $\textsc{Atlas}$-XXL can be found in Table 18.

::: caption="Table 16: MMLU Test set scores for Atlas for each model size and each of the 57 domains."

:::

::: caption="Table 17: MMLU Test scores for the T5 closed book baseline for each model size and each of the 57 domains."

:::

::: {caption="Table 18: MMLU Test set scores for the de-biased Atlas-XXL using cyclic permutations for each of the 57 domains for zero-shot, 5 shot, 5-shot-multitask and the transfer setting."}

:::

A.2 Question answering

A.2.1 Training Details

For question answering, similarly to the MMLU experiments, we format the input using the following template:

question: question text answer: [MASK_0]

and train the model to generate the mask token followed by the answer:

[MASK_0] answer.

We generate answers using greedy decoding. For both training and testing, we retrieve 40 passages, and truncate the result of the concatenation between the query and the passages to 384 tokens.

For few-shot fine-tuning we train $\textsc{Atlas}$ for 30 steps using 64 random samples from the train sets. The retriever is trained using query-side fine-tuning. We select the model after 30 training steps. We use AdamW with a batch size of 32 and a learning rate of 4 x 10^-5 with linear decay and 5 iterations of warmup for both the language model and the retriever.

For the fine-tuning on the full datasets, we train the model for 5k gradient steps and refresh the index every 500 steps for the first 1, 000 training steps and every 2k training steps afterwards. We use AdamW with a batch size of 64 and a learning rate of 4 x 10^-5 with linear decay and 5 iterations of warmup for both the language model and the retriever. We evaluate models every 500 steps and select the best one on the validation set based on the exact match score.

A.2.2 Impact of scaling

In Table 19, we report performance on NaturalQuestions and TriviaQA as a function of the number of parameters in the reader module. Both for few-shot learning and full fine-tuning we observe strong improvements by scaling the size of the reader module. However we can notice sign of saturation when finetuning on full datasets, with limited gains when scaling from 3B to 11B parameters (+0.6% on NaturalQuestions, +0.5% on TriviaQA). While performance improves substantially when scaling from 3B to 11B parameters with 64 training samples, with +3.7% and +1.2% improvement on NaturalQuestions and TriviaQA respectively. For these experiments we use a setup similar to the one use in Table 8, except that we use an index composed of the December 2018 Wikipedia dump processed as described in Section 4.2.

\begin{tabular}{lcccc}
  \toprule
  Number of parameters & 220M & 770M & 3B & 11B \\
  \midrule
  NaturalQuestions 64-shot & 27.0 & 35.4 & 41.3 & 45.1 \\
  NaturalQuestions full & 54.1 & 60.8 & 63.4 & 64.0 \\ 
  \midrule
  TriviaQA 64-shot & 55.3 & 65.0 & 70.2 & 71.4 \\
  TriviaQA full & 71.8 & 74.9 & 77.5 & 78.0 \\
  \bottomrule
  \end{tabular}

A.3 KILT

For the results on KILT reported in Table 10 we fine-tune $\textsc{Atlas}$ individually on each dataset. We format the input using a template similar to the one used for question answering:

question: query text answer: [MASK_0]

and train the model to generate the mask token followed by the expected output:

[MASK_0] output.

We retrieve 20 passages and generate answer using greedy decoding. In KILT, FEVER is a two-way classification task of claims. We lexicalize the "SUPPORTS" (resp. 'REFUTES'') label into "true" (respectively "false").

For few-shot fine-tuning we train $\textsc{Atlas}$ for 30 steps using 64 random samples from the train sets. The retriever is trained using query-side fine-tuning. We evaluate models every 5 steps and select the best one on the development set based on the reported metric. We use AdamW with a batch size of 32 and a learning rate of 4 x 10^-5 with linear decay and 5 iterations of warmup for both the language model and the retriever.

For the fine-tuning on the full datasets, the model is trained for 5k gradient steps. We evaluate models every 500 steps and select the best one on the development set based on the reported metric. The index is refreshed every 500 step for the first 1000 iterations, and every 2k steps afterwards. We use AdamW with a batch size of 64 and a learning rate of 4 x 10^-5 with linear decay and 500 iterations of warmup for both the language model and the retriever.

We report results on the development sets in Table 20.

\begin{tabular}{lcccccccccc}
  \toprule
  \multirow{2}{*}{Model} & AIDA & FEV & T-REx & zsRE & NQ & HoPo & TQA & WoW \\
  & {\textsc{acc}} & {\textsc{acc}} & {\textsc{acc}} & {\textsc{acc}} & {\textsc{em}} & {\textsc{em}} & {\textsc{em}} & {\textsc{f1}} \\ \midrule
  \textsc{Atlas} 64-shot & 69.0 & 88.1 & 58.5 & 60.2 & 44.2 & 34.1 & 77.1 & 15.4 \\
  \textsc{Atlas} full dataset & 92.7 & 94.4 & 84.8 & 80.9 & 63.4 & 51.4 & 84.4 & 21.0 \\
  \bottomrule
  \end{tabular}

References

Section Summary: This references section compiles a list of academic papers focused on the development of large language models, their scaling techniques, and methods for improving performance through retrieval and training optimizations. The works cited, primarily from 2019 to 2022, explore topics like few-shot learning, compute efficiency, and knowledge augmentation for AI systems. Authors range from prominent research teams at organizations advancing natural language processing.

[1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.

[2] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling language models: Methods, analysis & insights from training gopher, 2021. URL https://arxiv.org/abs/2112.11446.

[3] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.

[4] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.

[5] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020.

[6] Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401, 2020.

[7] Dani Yogatama, Cyprien de Masson d'Autume, and Lingpeng Kong. Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9:362–373, 2021. doi:10.1162/tacl_a_00371. URL https://aclanthology.org/2021.tacl-1.22.

[8] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improving language models by retrieving from trillions of tokens, 2021. URL https://arxiv.org/abs/2112.04426.

[9] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning, 2022. URL https://arxiv.org/abs/2112.09118.

[10] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.

[11] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

[12] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp. 2333–2338, 2013.

[13] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.

[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738, 2020.

[15] Gautier Izacard and Edouard Grave. Distilling knowledge from reader to retriever for question answering. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=NTEz-6wysdb.

[16] Devendra Singh Sachan, Siva Reddy, William Hamilton, Chris Dyer, and Dani Yogatama. End-to-end training of multi-document reader and retriever for open-domain question answering, 2021. URL https://arxiv.org/abs/2106.05346.

[17] Ellen M Voorhees et al. The TREC-8 question answering track report. In TREC, 1999.

[18] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. In Proc. ACL, 2017.

[19] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi:10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.

[20] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.

[21] Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm.

[22] Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Pierre-Emmanuel Mazaré, Armand Joulin, Edouard Grave, and Sebastian Riedel. Improving wikipedia verifiability with ai, 2022. URL https://arxiv.org/abs/2207.06220.

[23] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 1972.

[24] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at TREC-3. NIST Special Publication Sp, 1995.

[25] Wen-tau Yih, Kristina Toutanova, John C. Platt, and Christopher Meek. Learning Discriminative Projections for Text Similarity Measures. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL '11, pp. 247–256, USA, 2011. Association for Computational Linguistics. ISBN 978-1-932432-92-3. event-place: Portland, Oregon.

[26] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web, pp. 373–374, 2014. doi:10.1145/2567948.2577348. URL https://doi.org/10.1145/2567948.2577348.

[27] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020.

[28] Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. Pretrained Transformers for Text Ranking: BERT and Beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM '21, pp. 1154–1156, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 978-1-4503-8297-7. doi:10.1145/3437963.3441667. URL https://doi.org/10.1145/3437963.3441667. event-place: Virtual Event, Israel.

[29] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In Proc. ACL, 2018.

[30] Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallapati, and Bing Xiang. Multi-passage BERT: A globally normalized BERT model for open-domain question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5878–5882, 2019. doi:10.18653/v1/D19-1599. URL https://aclanthology.org/D19-1599.

[31] Matthew Richardson. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In Proceedings of the 2013 Conference on Emprical Methods in Natural Language Processing (EMNLP 2013), October 2013. URL https://www.microsoft.com/en-us/research/publication/mctest-challenge-dataset-open-domain-machine-comprehension-text/.

[32] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. EMNLP, 2016.

[33] Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. arXiv preprint arXiv:2004.10645, 2020.

[34] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Retrieval augmentation reduces hallucination in conversation, 2021. URL https://arxiv.org/abs/2104.07567.

[35] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, 2019.

[36] Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Chen. Learning dense representations of phrases at scale. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6634–6647, Online, August 2021b. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.518. URL https://aclanthology.org/2021.acl-long.518.

[37] Haejun Lee, Akhil Kedia, Jongwon Lee, Ashwin Paranjape, Christopher D. Manning, and Kyoung-Gu Woo. You only need one model for open-domain question answering, 2021a. URL https://arxiv.org/abs/2112.07381.

[38] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proc. ACL, 2019.

[39] Ashwin Paranjape, Omar Khattab, Christopher Potts, Matei Zaharia, and Christopher D. Manning. Hindsight: Posterior-guided training of retrievers for improved open-ended generation, 2021. URL https://arxiv.org/abs/2110.07752.

[40] Sebastian Hofstätter, Jiecao Chen, Karthik Raman, and Hamed Zamani. Multi-task retrieval-augmented text generation with relevance sampling. ArXiv, abs/2207.03030, 2022.

[41] Ori Ram, Gal Shachaf, Omer Levy, Jonathan Berant, and Amir Globerson. Learning to retrieve passages without supervision. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2687–2700, Seattle, United States, July 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.naacl-main.193.

[42] Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In International Conference on Learning Representations, 2017b. URL https://openreview.net/forum?id=B184E5qee.

[43] Edouard Grave, Moustapha Cisse, and Armand Joulin. Unbounded cache model for online language modeling with open vocabulary, 2017a. URL https://arxiv.org/abs/1711.02604.

[44] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH.

[45] Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332, 2021.

[46] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam M. Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Yaguang Li, Hongrae Lee, Huaixiu Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, I. A. Krivokon, Willard James Rusch, Marc Pickett, Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Hartz Søraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Díaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravindran Rajakumar, Alena Butryna, Matthew Lamm, V. O. Kuzmina, Joseph Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog applications. ArXiv, abs/2201.08239, 2022.

[47] Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur D. Szlam, and Jason Weston. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. ArXiv, abs/2203.13224, 2022.

[48] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. ArXiv, abs/2203.05115, 2022.

[49] Sebastian Thrun and Lorien Pratt. Learning to Learn: Introduction and Overview, pp. 3–17. Kluwer Academic Publishers, USA, 1998. ISBN 0792380479.

[50] Michael Fink. Object classification from a single example utilizing class relevance metrics. In L. Saul, Y. Weiss, and L. Bottou (eds.), Advances in Neural Information Processing Systems, volume 17. MIT Press, 2005. URL https://proceedings.neurips.cc/paper/2004/file/ef1e491a766ce3127556063d49bc2f98-Paper.pdf.

[51] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf.

[52] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022. URL https://arxiv.org/abs/2206.07682.

[53] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Technical Report, 2019.

[54] Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021.

[55] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022. URL https://arxiv.org/abs/2201.11990.

[56] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.

[57] Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 255–269. Association for Computational Linguistics, 2021a. doi:10.18653/v1/2021.eacl-main.20. URL https://aclanthology.org/2021.eacl-main.20.

[58] Timo Schick and H. Schutze. It’s not just size that matters: Small language models are also few-shot learners. ArXiv, abs/2009.07118, 2021.

[59] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295.

[60] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin Raffel. Improving and simplifying pattern exploiting training. ArXiv, abs/2103.11955, 2021.

[61] Timo Schick and Hinrich Schütze. Few-shot text generation with natural language instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 390–402. Association for Computational Linguistics, November 2021b. doi:10.18653/v1/2021.emnlp-main.32. URL https://aclanthology.org/2021.emnlp-main.32.

[62] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020. doi:10.1162/tacl_a_00324. URL https://aclanthology.org/2020.tacl-1.28.

[63] Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.346. URL https://aclanthology.org/2020.emnlp-main.346.

[64] IV Robert L. Logan, Ivana Balavzevi'c, Eric Wallace, Fabio Petroni, Sameer Singh, and Sebastian Riedel. Cutting down on prompts and parameters: Simple few-shot learning with language models. ArXiv, abs/2106.13353, 2021.

[65] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021. URL https://arxiv.org/abs/2104.08691.

[66] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597, Online, August 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.

[67] Teven Le Scao and Alexander Rush. How many data points is a prompt worth? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2627–2636, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.208. URL https://aclanthology.org/2021.naacl-main.208.

[68] Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models' factual predictions. arXiv preprint arXiv:2005.04611, 2020.

[69] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proc. ACL, 2017.

[70] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018. URL https://arxiv.org/abs/1809.09600.

[71] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 333–342, 2017. doi:10.18653/v1/K17-1034. URL https://aclanthology.org/K17-1034.

[72] Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. URL https://aclanthology.org/L18-1544.

[73] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust disambiguation of named entities in text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 782–792, 2011. URL https://aclanthology.org/D11-1072.

[74] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[75] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.

[76] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.

[77] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260.

[78] Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and William W. Cohen. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273, 2022. doi:10.1162/tacl_a_00459. URL https://aclanthology.org/2022.tacl-1.15.

[79] Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, 2020.

[80] Martin Fajcik, Martin Docekal, Karel Ondrej, and Pavel Smrz. R2-D2: A modular baseline for open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 854–870, November 2021. doi:10.18653/v1/2021.findings-emnlp.73. URL https://aclanthology.org/2021.findings-emnlp.73.

[81] Amrith Krishna, Sebastian Riedel, and Andreas Vlachos. Proofver: Natural logic theorem proving for fact verification, 2021. URL https://arxiv.org/abs/2108.11357.

[82] Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=5k8F6UU39V.

[83] Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, and Sebastian Riedel. The web is your oyster - knowledge-intensive nlp against a very large web corpus, 2021. URL https://arxiv.org/abs/2112.09924.

[84] Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen-tau Yih, Sebastian Riedel, and Fabio Petroni. Autoregressive search engines: Generating substrings as document identifiers, 2022. URL https://arxiv.org/abs/2204.10628.

[85] Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, and Alfio Gliozzo. Re2g: Retrieve, rerank, generate, 2022. URL https://arxiv.org/abs/2207.06300.

[86] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In USENIX Security Symposium, pp. 2633–2650, 2021. URL https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.

[87] Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.98. URL https://aclanthology.org/2021.emnlp-main.98.

[88] Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Sebastian Riedel, and Edouard Grave. A memory efficient baseline for open domain question answering. arXiv preprint arXiv:2012.15156, 2020.

[89] Herve Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. doi:10.1109/TPAMI.2010.57.

[90] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1896–1907, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.171. URL https://aclanthology.org/2020.findings-emnlp.171.