Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Show me an executive summary.

Purpose and context

Organizations developing multi-modal AI systems face a fundamental architectural choice: how to build a single model that can both understand and generate different types of data—specifically text and images. Current approaches either bolt separate specialized models together (one for text, one for images) or convert everything to discrete tokens and use language modeling alone. The first approach is complex and costly; the second loses information when converting images to tokens. This work introduces Transfusion, a method that trains one unified model using different learning objectives matched to each data type: standard next-token prediction for text and diffusion (a continuous modeling technique) for images.

What was done

The research team built and tested Transfusion models ranging from 160 million to 7 billion parameters. The core innovation is training a single transformer network on mixed sequences of text tokens and image patches, applying language modeling loss to text and diffusion loss to images within the same training step. Images are represented as continuous vectors (not quantized to discrete tokens), and the model uses causal attention for text but bidirectional attention within each image. The team trained models on up to 2 trillion tokens (half text, half image data) and compared performance against Chameleon, a published baseline that discretizes images and models everything as language.

Key findings

Transfusion consistently outperforms the discretization approach across all tested tasks. For image generation, Transfusion achieves the same quality as Chameleon using only 3% of the compute resources (34 times more efficient). For image captioning, Transfusion matches Chameleon performance at 22% of the compute. Unexpectedly, Transfusion also improves text-only tasks, reaching the same text perplexity as Chameleon at 49-60% of the compute, even though both methods model text identically. The analysis shows this advantage comes from avoiding the overhead of discretizing images and from architectural stability improvements needed by Chameleon.

When scaled to 7 billion parameters and trained on 2 trillion tokens, the Transfusion model generates images comparable to specialized diffusion models like DALL-E 2 and SDXL, while simultaneously matching the text generation performance of Llama language models trained on the same text data. An additional experiment showed that fine-tuning on just 8,000 image editing examples enables the model to perform image-to-image transformations, suggesting the approach generalizes to modality combinations not seen during initial training.

What this means

These results demonstrate that organizations can build single unified models for multi-modal AI without sacrificing performance or efficiency. Transfusion eliminates the need to either maintain multiple specialized models or accept information loss from quantization. The compute savings are substantial—training costs for achieving a given level of image generation quality drop by roughly 30-fold compared to the discretization approach. The method also reduces serving costs because images can be compressed to as few as 16 patches (versus 1,024 discrete tokens) with minimal quality loss when using appropriate encoder/decoder architecture.

The unexpected improvement on text tasks suggests that keeping images continuous frees up model capacity that would otherwise be spent learning to predict image tokens. This means the same parameter budget delivers better performance across all modalities.

Recommendations and next steps

For organizations building multi-modal AI systems, Transfusion should be considered as the baseline architecture. Specifically:

- Adopt the Transfusion training recipe (combined language modeling and diffusion objectives) rather than discretizing all modalities to tokens. This delivers better performance at lower training cost.

- When compute efficiency is critical, incorporate U-Net encoding/decoding layers to compress images to larger patches (4×4 or 8×8 latent pixels). This cuts inference costs by up to 64 times with only small quality reductions.

- For models that will process images before generating text (e.g., captioning), implement noise limiting during training (cap diffusion noise at t=500 when images precede captions). This improves captioning performance by over 15% with negligible impact on other tasks.

Future work should explore: (1) scaling beyond 7 billion parameters to establish whether advantages hold at frontier model sizes, (2) extending to video and audio modalities using the same principle of modality-specific objectives, and (3) investigating whether synthetic caption data (which benefits competing approaches) provides similar gains for Transfusion.

Limitations and confidence

The results are based on one image resolution (256×256 pixels) and one data mixture (50% text, 50% images by token count). Performance at higher resolutions, different data ratios, or with other modalities (video, audio) remains untested. The largest model trained (7 billion parameters) is well below frontier scale, so whether advantages persist at 70 billion or 400 billion parameters is unknown.

The comparison focuses primarily on one baseline (Chameleon). While results on standard benchmarks show Transfusion matching specialized image models, head-to-head comparisons with the latest diffusion models using identical training data would strengthen confidence. The image editing experiment used only 8,000 examples, so robustness of this capability needs validation on larger datasets.

Confidence is high that Transfusion delivers substantial efficiency gains over discretization approaches at the tested scales and conditions. Confidence is moderate that advantages will persist at much larger scales or with significantly different architectural choices.

Authors: Chunting Zhou

^{μ*}

, Lili Yu

^{μ*}

, Arun Babu

^{δ†}

, Kushal Tirumala

^{μ}

, Michihiro Yasunaga

^{μ}

, Leonid Shamis

^{μ}

, Jacob Kahn

^{μ}

, Xuezhe Ma

^{σ}

, Luke Zettlemoyer

^{μ}

, Omer Levy

†^{†}

^{μ}

Abstract

We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.

1. Introduction

Show me a brief summary.

In this section, the authors introduce Transfusion as a solution to the challenge of building multi-modal generative models that handle both discrete data like text and continuous data like images. While language models excel at discrete modalities through next token prediction and diffusion models dominate continuous generation, existing approaches to combine them either use diffusion as an external tool, graft pretrained models together, or quantize continuous data into discrete tokens at the cost of information loss. Transfusion instead trains a single transformer to seamlessly generate both modalities by applying next token prediction for text and diffusion for images within the same model. Controlled experiments demonstrate that Transfusion scales more efficiently than quantization-based approaches like Chameleon, achieving comparable text-to-image quality at one-third the compute and similar image-to-text performance at 22% of the FLOPs. The largest 7B parameter model trained on 2T multi-modal tokens generates images rivaling dedicated diffusion models while maintaining text generation capabilities comparable to Llama 1, establishing Transfusion as a promising unified multi-modal architecture.

Multi-modal generative models need to be able to perceive, process, and produce both discrete elements (such as text or code) and continuous elements (e.g. image, audio, and video data). While language models trained on the next token prediction objective dominate discrete modalities ([1, 2]), diffusion models ([3, 4]) and their generalizations ([5]) are the state of the art for generating continuous modalities ([6, 7, 8]). Many efforts have been made to combine these approaches, including extending a language model to use a diffusion model as a tool, either explicitly ([9]) or by grafting a pretrained diffusion model onto the language model ([10, 11]). Alternatively, one can quantize the continuous modalities ([12]) and train a standard language model over discrete tokens ([13, 14, 15]), simplifying the model's architecture at the cost of losing information. In this work, we show it is possible to fully integrate both modalities, with no information loss, by training a single model to both predict discrete text tokens and diffuse continuous images.

We introduce Transfusion, a recipe for training a model that can seamlessly generate discrete and continuous modalities. We demonstrate Transfusion by pretraining a transformer model on 50% text and 50% image data using a different objective for each modality: next token prediction for text and diffusion for images. The model is exposed to both modalities and loss functions at each training step. Standard embedding layers convert text tokens to vectors, while patchification layers represent each image as a sequence of patch vectors. We apply causal attention for text tokens and bidirectional attention for image patches. For inference, we introduce a decoding algorithm that combines the standard practices of text generation from language models and image generation from diffusion models. Figure 1 illustrates Transfusion.

In a controlled comparison with Chameleon's discretization approach ([16]), we show that Transfusion models scale better in every combination of modalities. In text-to-image generation, we find that Transfusion exceeds the Chameleon approach at less than a third of the compute, as measured by both FID and CLIP scores. When controlling for FLOPs, Transfusion achieves approximately 2

×\times

lower FID scores than Chameleon models. We observe a similar trend in image-to-text generation, where Transfusion matches Chameleon at 21.8% of the FLOPs. Surprisingly, Transfusion is also more efficient at learning text-to-text prediction, achieving perplexity parity on text tasks around 50% to 60% of Chameleon's FLOPs.

Ablation experiments reveal critical components and potential improvements for Transfusion. We observe that the intra-image bidirectional attention is important, and that replacing it with causal attention hurts text-to-image generation. We also find that adding U-Net down and up blocks to encode and decode images enables Transfusion to compress larger image patches with relatively small loss to performance, potentially decreasing the serving costs by up to 64

×\times

Finally, we demonstrate that Transfusion can generate images at similar quality to other diffusion models. We train from scratch a 7B transformer enhanced with U-Net down/up layers (0.27B parameters) over 2T tokens: 1T text tokens, and approximately 5 epochs of 692M images and their captions, amounting to another 1T patches/tokens. Figure 2 shows some generated images sampled from the model. On the GenEval ([17]) benchmark, our model outperforms other popular models such as DALL-E 2 and SDXL; unlike those image generation models, it can generate text, reaching the same level of performance as Llama 1 on text benchmarks. Our experiments thus show that Transfusion is a promising approach for training truly multi-modal models.

2. Background

Show me a brief summary.

In this section, the foundational techniques underlying Transfusion are established by examining how discrete and continuous data are modeled using different state-of-the-art approaches. Language modeling handles discrete tokens through autoregressive next-token prediction, decomposing sequence probability into conditional distributions optimized via cross-entropy loss, enabling token-by-token text generation. Diffusion models address continuous data by learning to reverse a gradual noise-addition process: a forward Markov chain progressively corrupts data with Gaussian noise over multiple timesteps, while a trained reverse process iteratively denoises by predicting the accumulated noise at each step, optimized through mean squared error. To make diffusion computationally tractable for images, variational autoencoders compress high-dimensional pixel data into compact latent representations, allowing efficient processing of image patches as low-dimensional vectors. These complementary approaches—autoregressive modeling for discrete modalities and diffusion for continuous ones—form the basis for integrating both paradigms into a unified multi-modal framework.

Transfusion is a single model trained with two objectives: language modeling and diffusion. Each of these objectives represents the state of the art in discrete and continuous data modeling, respectively. This section briefly defines these objectives, as well as background on latent image representations.

2.1 Language Modeling

Given a sequence of discrete tokens

y = y_1, ..., y_n

from a closed vocabulary

V

, a language model predicts the probability of the sequence

P (y)

. Standard language models decompose

P (y)

into a product of conditional probabilities

∏i=1nPθ(yi∣y<i)\prod_{i=1}^n P_\theta(y_i|y_{<i})

. This creates an autoregressive classification task, where the probability distribution of each token

y_i

is predicted conditioned on the prefix of a sequence

y_{<i}

using a single distribution

PθP_\theta

parameterized by

θ\theta

. The model can be optimized by minimizing the cross-entropy between

PθP_\theta

and the empirical distribution of the data, yielding the standard next-token prediction objective, colloquially referred to as LM loss:

Once trained, language models can also be used to generate text by sampling token by token from the model distribution

PθP_\theta

, typically using temperature and top-p truncation.

2.2 Diffusion

Denoising diffusion probabilistic models (a.k.a. DDPM or diffusion models) operate on the principle of learning to reverse a gradual noise-addition process ([3]). Unlike language models that typically work with discrete tokens (

y

), diffusion models operate over continuous vectors (

x\mathbf{x}

), making them particularly suited for tasks involving continuous data like images. The diffusion framework involves two processes: a forward process that describes how the original data is turned into noise, and a reverse process of denoising that the model learns to perform.

Forward Process From a mathematical perspective, the forward process defines how the noised data (which serves as the model input) is created. Given a data point

x0\mathbf{x}_0

, [3] define a Markov chain that gradually adds Gaussian noise over

T

steps, creating a sequence of increasingly noisy versions

x1,x2,...,xT\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T

. Each step of this process is defined by

q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I})

, where

βt\beta_t

increases over time according to a predefined noise schedule (see below). This process can be reparameterized in a way that allows us to directly sample

xt\mathbf{x}_t

from

x0\mathbf{x}_0

using a single sample of Gaussian noise

ϵ∼N(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

Here,

αˉt=∏s=1t(1−βs)\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)

, providing a useful abstraction over the original Markov chain. In fact, both the training objective and the noise scheduler are eventually expressed (and implemented) in these terms.

Reverse Process The diffusion model is trained to perform the reverse process

pθ(xt−1∣xt)p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)

, learning to denoise the data step by step. There are several ways to do so; in this work, we follow the approach of [3] and model the Gaussian noise

ϵ\boldsymbol{\epsilon}

in Equation 1 as a proxy for the cumulative noise at step

t

. Specifically, a model

ϵθ(⋅)\epsilon_\theta(\cdot)

with parameters

θ\theta

is trained to estimate the noise

ϵ\boldsymbol{\epsilon}

given the noised data

xt\mathbf{x}_t

and timestep

t

. In practice, the model often conditions on additional contextual information

c

, such as a caption when generating an image. The parameters of the noise prediction model are thus optimized by minimizing the mean squared error loss:

Noise Schedule When creating a noised example

xt\mathbf{x}_t

(Equation 1),

αˉt\bar{\alpha}_t

determines the variance of the noise for timestep

t

. In this work, we adopt the commonly used cosine scheduler [18], which largely follows

αˉt≈cos⁡(tT⋅π2)\sqrt{\bar{\alpha}_t} \approx \cos(\frac{t}{T}\cdot\frac{\pi}{2})

with some adjustments.

Inference Decoding is done iteratively, pealing away some of the noise at each step. Starting with pure Gaussian noise at

xT\mathbf{x}_T

, the model

ϵθ(xt,t,c)\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t, t, c)

predicts the noise accumulated at timestep

t

. The predicted noise is then scaled according to the noise schedule, and the proportional amount of predicted noise is removed from

xt\mathbf{x}_t

to produce

xt−1\mathbf{x}_{t-1}

. In practice, inference is done over fewer timesteps than training. Classifier-free guidance (CFG) ([19]) is often used to improve generation by contrasting the prediction of the model conditioned on the context

c

with the unconditioned prediction, at the cost of doubling the computation.

2.3 Latent Image Representation

Early diffusion models worked directly in pixel space ([3]), but this proved computationally expensive. Variational autoencoders (VAEs) ([20]) can save compute by encoding images into a lower-dimensional latent space. Implemented as deep CNNs, modern VAEs are trained on a combination of reconstruction and regularization losses ([21]), allowing downstream models like latent diffusion models (LDMs) ([4]) to operate efficiently on compact image patch embeddings; e.g. represent every 8

×\times

8 pixel patch as an 8-dimensional vector. For autoregressive language modeling approaches ([13, 14]), images must be discretized. Discrete autoencoders, such as vector-quantized VAEs (VQ-VAE) ([12]), achieve this by introducing a quantization layer (and related regularization losses) that maps continuous latent embeddings to discrete tokens.

3. Transfusion

Show me a brief summary.

In this section, Transfusion addresses the challenge of training a single model to generate both discrete text and continuous images by combining language modeling and diffusion objectives within one transformer architecture. The method represents text as discrete tokens and images as continuous patch vectors from a VAE, separating modalities with special BOI and EOI markers. The model employs a hybrid attention pattern: causal attention for text tokens and bidirectional attention within each image while maintaining causality across the sequence. Training optimizes a combined loss that applies next-token prediction to text and diffusion loss to images, weighted by a balancing coefficient. Modality-specific encoding layers (embeddings for text, linear or U-Net layers for images) convert inputs to a shared vector space that the transformer processes. During inference, the decoder switches between autoregressive text sampling and iterative diffusion denoising when encountering BOI tokens, enabling seamless generation of mixed-modality sequences without information loss from quantization.

Transfusion is a method for training a single unified model to understand and generate both discrete and continuous modalities. Our main innovation is demonstrating that we can use separate losses for different modalities – language modeling for text, diffusion for images – over shared data and parameters. Figure 1 illustrates Transfusion.

Data Representation We experiment with data spanning two modalities: discrete text and continuous images. Each text string is tokenized into a sequence of discrete tokens from a fixed vocabulary, where each token is represented as an integer. Each image is encoded as latent patches using a VAE (see § 2.3), where each patch is represented as a continuous vector; the patches are sequenced left-to-right top-to-bottom to create a sequence of patch vectors from each image.¹ For mixed-modal examples, we surround each image sequence with special beginning of image (BOI) and end of image (EOI) tokens before inserting it to the text sequence; thus, we arrive at a single sequence potentially containing both discrete elements (integers representing text tokens) and continuous elements (vectors representing image patches).

While our canonical setting uses a VAE following latent diffusion models, we were also able to demonstrate Transfusion using raw pixel representations in preliminary experiments.

Model Architecture The vast majority of the model's parameters belong to a single transformer, which processes every sequence, regardless of modality.²³ The transformer takes a sequence of high-dimensional vectors in

Rd\mathbb{R}^d

as input, and produces similar vectors as output. To convert our data into this space, we use lightweight modality-specific components with unshared parameters. For text, these are the embedding matrices, converting each input integer to vector space and each output vector into a discrete distribution over the vocabulary. For images, we experiment with two alternatives for compressing local windows of

\times k

patch vectors into a single transformer vector (and vice versa): (1) a simple linear layer, ⁴ and (2) up and down blocks of a U-Net ([18, 22]).⁵ Figure 3 illustrates the overall architecture.

We follow Llama's ([23]) flavor of the transformer block, which includes the SwiGLU activation function ([24]) and RoPE ([25]).

While we use the transformer architecture in this work, Transfusion could potentially work with other architectures too, despite its name.

We add an embedding of the timestep

t

to every patch vector before the linear layer.

We replace the U-Net's AdaLayerNorm with regular layer norm in our implementation.

Transfusion Attention Language models typically use causal masking to efficiently compute the loss and gradients over an entire sequence in a single forward-backward pass without leaking information from future tokens. While text is naturally sequential, images are not, and are usually modeled with unrestricted (bidirectional) attention. Transfusion combines both attention patterns by applying causal attention to every element in the sequence, and bidirectional attention within the elements of each individual image. This allows every image patch to attend to every other patch within the same image, but only attend to text or patches of other images that appeared previously in the sequence. We find that enabling intra-image attention significantly boosts model performance (see § 4.3). Figure 4 shows an example Transfusion attention mask.

Training Objective To train our model, we apply the language modeling objective

LLM\mathcal{L}_\text{LM}

to predictions of text tokens and the diffusion objective

LDDPM\mathcal{L}_\text{DDPM}

to predictions of image patches. LM loss is computed per token, ⁶ while diffusion loss is computed per image, which may span multiple elements (image patches) in the sequence. Specifically, we add noise

ϵ\boldsymbol{\epsilon}

to each input latent image

x0\mathbf{x}_0

according to the diffusion process to produce

xt\mathbf{x}_t

before patchification, and then compute the image-level diffusion loss.⁷ We combine the two losses by simply adding the losses computed over each modality with a balancing coefficient

λ\lambda

When the input is a BOI token, we do not compute any loss.

Ergo, downstream tokens condition on noisy images during training. See § 4.3.4 for further discussion.

This formulation is a specific instantiation of a broader idea: combining a discrete distribution loss with a continuous distribution loss to optimize the same model. We leave further exploration of this space, such as replacing diffusion with flow matching ([5])), to future work.

Inference Reflecting the training objective, our decoding algorithm also switches between two modes: LM and diffusion. In LM mode, we follow the standard practice of sampling token by token from the predicted distribution. When we sample a BOI token, the decoding algorithm switches to diffusion mode, where we follow the standard procedure of decoding from diffusion models. Specifically, we append a pure noise

xT\mathbf{x}_T

in the form of

n

image patches to the input sequence (depending on the desired image size), and denoise over

T

steps. At each step

t

, we take the noise prediction and use it to produce

xt−1\mathbf{x}_{t-1}

, which then overwrites

xt\mathbf{x}_{t}

in the sequence; i.e. the model always conditions on the last timestep of the noised image and cannot attend to previous timesteps. Once the diffusion process has ended, we append an EOI token to the predicted image, and switch back to LM mode. This algorithm enables the generation of any mixture of text and image modalities.

4. Experiments

Show me a brief summary.

In this section, the authors demonstrate Transfusion's viability as a scalable multi-modal modeling approach through controlled experiments comparing it against Chameleon, a discrete tokenization baseline. Training models from 0.16B to 7B parameters on 0.5T tokens across text and image modalities, they find Transfusion consistently outperforms Chameleon across all benchmarks, achieving comparable image generation quality with 34× less compute and surprisingly better text performance despite both methods modeling text identically. Ablation studies reveal that bidirectional attention within images significantly improves generation quality, U-Net encoding layers outperform simple linear projections especially at larger patch sizes, and the approach scales effectively up to 7B parameters trained on 2T tokens. The largest model matches or exceeds established diffusion models like DALL-E 2 and SDXL on GenEval while maintaining Llama-level text capabilities, and preliminary fine-tuning experiments on image editing tasks suggest the framework generalizes to new modality combinations not seen during pretraining.

We demonstrate in a series of controlled experiments that Transfusion is a viable, scalable method for training a unified multi-modal model.

4.1 Setup

Evaluation We evaluate model performance on a collection of standard uni-modal and cross-modal benchmarks (Table 1). For text-to-text, we measure perplexity on 20M held-out tokens from Wikipedia and the C4 corpus ([26]), as well as accuracy on the pretraining evaluation suite of Llama 2 ([27]).⁸ For text-to-image, we use the MS-COCO benchmark ([28]), where we generate images on randomly selected 30k prompts from validation set and measure their photo-realism using zero-shot Frechet Inception Distance (FID) ([29]) as well as their alignment with the prompts using CLIP score ([30]).⁹ We also evaluate the model's ability to generate image captions; we report CIDEr ([31]) scores on the Karpathy test split of MS-COCO ([28]). These evaluations provide signal for investigation scaling laws (§ 4.2) and ablations (§ 4.3). To compare with recent literature in diffusion models, we evaluate our largest scale model (§ 4.4) also on GenEval ([17]), a benchmark that examines a model's ability to generate an accurate depiction of the prompt.

The Llama 2 evaluation suite includes HellaSwag ([32]), PIQA ([33]), SIQA ([34]), WinoGrande ([35]), ARC-e and -c ([36]), and BoolQ ([37]). We report the average 0-shot task accuracy on these benchmarks.

We follow common practice for ablations and use only 5k examples to compute FID and CLIP in § 4.3.

Table 1: An overview of the evaluation suite used in this work.

Input	Output	Benchmark	Metric
Text	Text	Wikipedia	Perplexity ( $\downarrow$ )
		C4	Perplexity ( $\downarrow$ )
		Llama 2 Eval Suite	Accuracy ( $\uparrow$ )
Image	Text	MS-COCO 5k	CIDEr ( $\uparrow$ )

Table 2: Model sizes and configurations for both Transfusion and baselines.

Size	Layers	Emb Dim	Att Heads
0.16B	16	768	12
0.37B	24	1024	16
0.76B	24	1536	24
1.4B	24	2048	16

Baseline At the time of writing, the prominent open-science method for training a single mixed-modal model that can generate both text and images is to quantize images into discrete tokens, and then model the entire token sequence with a standard language model ([13, 14, 15]). We follow the recipe of Chameleon ([16]) to train a family of data- and compute-controlled baseline models, which we can directly compare to our Transfusion models. The key difference between Chameleon and Transfusion is that while Chameleon discretizes images and processes them as tokens, Transfusion keeps images in continuous space, removing the quantization information bottleneck. To further minimize any confounding variables, we train the VAEs for Chameleon and Transfusion using exactly the same data, compute, and architecture, with the only differentiator being the quantization layer and codebook loss of Chameleon's VQ-VAE (see details below). Chameleon also deviates from the Llama transformer architecture, adding query-key normalization, post-normalization, denominator loss, and a lower learning rate of 1e-4 to manage training instability, which incur an efficiency cost (see § 4.2).¹⁰

Removing these deviations in preliminary experiments encountered optimization instabilities in Chameleon.

Data For almost all of our experiments, we sample 0.5T tokens (patches) from two datasets at a 1:1 token ratio. For text, we use the Llama 2 tokenizer and corpus ([27]), containing 2T tokens across a diverse distribution of domains. For images, we use a collection of 380M licensed Shutterstock images and captions. Each image is center-cropped and resized to produce a 256

×\times

256 pixel image.¹¹ We randomly order the image and captions, ordering the caption first 80% of the time.

Depending on the compression rate of the patch encoder (see Model Architecture in § 3), each image will be represented by either 1024, 256, 64, or 16 elements in the sequence. Since the text/image ratio is constant during training, higher compression rates enable training on more images in total, at the cost of less compute per image.

In one experiment (Section 4.4) we scale up the total training data to 2T tokens (1T text tokens and about 3.5B caption-image pairs at 256 patches per image). To diversify, we add 220M publicly available images with captions, prefiltered to not contain people. To rebalance the distribution, we upsample 80M Shutterstock images containing people. We also add data from Conceptual 12M (CC12M) ([38]), reaching a total mixture of 692M image-caption pairs per epoch. Finally, we upweight the portion of high-aesthetic images in the last 1% of the training schedule.

Latent Image Representation We train a 86M parameter VAE following [21]. We use a CNN encoder and decoder, and latent dimension 8. The training objective is combines reconstruction and regularization losses.¹² Our implementation reduces an image of 256

×\times

256 pixels to a 32

×\times

×\times

8 tensor, where each latent 8-dimensional latent pixel represents (conceptually) an 8

×\times

8 pixel patch in the original image, and trains for 1M steps. For VQ-VAE training, we follow the same setup described for VAE training, except we replace

LKL\mathcal{L}_{\text{KL}}

with the standard codebook commitment loss with

β=0.25\beta = 0.25

([12]). We use a codebook of 16, 384 token types.

See Appendix A for details.

Model Configuration To investigate scaling trends, we train models at five different sizes – 0.16B, 0.37B, 0.76B, 1.4B, and 7B parameters – following the standard settings from Llama ([23]). Table 2 describes each setting in detail. In configurations that use linear patch encoding (§ 4.2 and § 4.3), the number of additional parameters is insignificant, accounting for fewer than 0.5% of total parameters in every configuration. When using U-Net patch encoding (§ 4.3 and § 4.4), these parameters add up to 0.27B additional parameters across all configurations; while this is a substantial addition of parameters to smaller models, these layers amount to only a 3.8% increase of the 7B configuration, almost identical to the number of parameters in the embedding layers.

Optimization We randomly initialize all model parameters, and optimize them using AdamW (

β1=\beta_1=

0.9,

β2=\beta_2=

0.95,

ϵ=\epsilon=

1e-8) with a learning rate of 3e-4, warmed up for 4000 steps and decaying to 1.5e-5 using a cosine scheduler. We train on sequences of 4096 tokens in batches of 2M tokens for 250k steps, reaching 0.5T tokens in total. In our large-scale experiment (§ 4.4), we train with a batch size of 4M tokens over 500k steps, totalling 2T tokens. We regularize with weight decay of 0.1 and clip gradients by norm (1.0). We set the

λ\lambda

coefficient in the Transfusion objective (Equation 2) to 5 following preliminary experiments; we leave further tuning of

λ\lambda

to future work.

Inference In text mode, we use greedy decoding for generating text. Ranked classification is used for the Llama evaluation suite. For image generation, we follow the standard of 250 diffusion steps (the model is trained on 1, 000 timesteps). We follow Chameleon and use CFG with a coefficient of 5 in the controlled comparison experiments (§ 4.2). This value is suboptimal for Transfusion, and so we use a CFG coefficient of 3 throughout the ablation experiments (§ 4.3), and follow the standard practice of tuning the coefficient for each benchmark in our large scale experiment (§ 4.4).

4.2 Controlled Comparison with Chameleon

We run a series of controlled experiments to compare Transfusion with Chameleon at different model sizes (

N

) and token counts (

D

), using the combination of both as a proxy for FLOPs (

6 N D

).¹³ For simplicity and parameter control, the Transfusion variant in these experiments uses simple linear image encoder/decoder with patch size 2

×\times

2, as well as bidirectional attention. For each benchmark, we plot all results on a log-metric over log-FLOPs curve and regress linear trendlines.¹⁴ We also estimate relative compute efficiency by measuring the parity FLOP ratio: the ratio between the number of FLOPs required by Transfusion and Chameleon to reach the same level of performance.

Since Transfusion uses continuous representations of images, it can express a single image with significantly fewer tokens, shortening the average document length and thus the overall quadratic price of attention. Since this fact favors Transfusion, we remove this confounder by using the theoretical FLOP calculation.

The smaller Chameleon models perform poorly on image generation and understanding tasks, leading to outlier results that do not correlate with the emerging scaling law of larger Chameleon models. We therefore define minimal performance thresholds below which we remove datapoints:

\leq

100 FID,

\geq

17 CLIP,

\geq

4 CIDEr.

Figure 5 visualizes the scaling trends, and Table 3 shows the results of the largest models in this controlled setting and their estimated parity FLOP ratio. In every benchmark, Transfusion consistently exhibits better scaling laws than Chameleon. While the lines are close to parallel, there is a significant gap in Transfusion's favor. The difference in compute efficiency is particularly striking in image generation, where FID Transfusion achieves parity with Chameleon using 34

×\times

less compute.

Surprisingly, text-only benchmarks also reveal better performance with Transfusion, even though both Transfusion and Chameleon model text in the same way. We investigate this phenomenon by ablating the various changes leading up to Transfusion and Chameleon from the original Llama 2 recipe. Table 4 shows that while Transfusion does come at a non-zero cost to text performance, the Chameleon recipe suffers from both the stability modifications made to the architecture and from the introduction of image tokens. Training on quantized image tokens degrades text performance more than diffusion on all three benchmarks. One hypothesis is that this stems from the competition between text and image tokens in the output distribution; alternatively, it is possible that diffusion is more efficient at image generation and requires fewer parameters, allowing Transfusion models to use more capacity than Chameleon to model text. We leave further investigation of this phenomenon to future research.

Table 3: Performance of the largest (7B) Transfusion and Chameleon models in a controlled setting. Both models were trained on 0.5T tokens. Parity FLOP Ratio is the relative amount of Transfusion FLOPs needed to match the results of Chameleon 7B.

Model	C4	Wiki	Llama	MS-COCO
Model	PPL ( $\downarrow$ )	PPL ( $\downarrow$ )	Acc ( $\uparrow$ )	CDr ( $\uparrow$ )	FID ( $\downarrow$ )	CLIP ( $\uparrow$ )
Transfusion	7.72	4.28	61.5	27.2	16.8	25.5
Chameleon	8.41	4.69	59.1	18.0	29.6	24.3
Parity FLOP Ratio	0.489	0.526	0.600	0.218	0.029	0.319

Table 4: Performance of the 0.76B Transfusion and Chameleon models on text-only benchmarks, compared to the original Llama 2 recipe.

Model		Batch	C4	Wiki	Llama
Model		Batch	PPL ( $\downarrow$ )	PPL ( $\downarrow$ )	Acc ( $\uparrow$ )
Llama 2		1M Text Tokens	10.1	5.8	53.7
Transfusion	+ Diffusion	+ 1M Image Patches	(+0.3) 10.4	(+0.2) 6.0	(-2.0) 51.7
Chameleon	+ Stability Modifications	1M Text Tokens	(+0.9) 11.0	(+0.5) 6.3	(-1.8) 51.9

4.3 Architecture Ablations

Now that we have established that Transfusion is a viable, scalable approach to multi-modal modeling in a controlled environment, we can explore improvements and extensions that are applicable to Transfusion alone.

4.3.1 Attention Masking

We first examine the necessity of intra-image bidirectional attention. Table 5 shows that enabling this attention pattern beyond the standard causal attention is advantageous throughout all benchmarks, and using both image encoding/decoding architectures. In particular, we notice a significant improvement in FID when using linear encoding layers (61.3

→\rightarrow

20.3). In the causal-only version of this architecture, there is no flow of information from patches that appear later in the sequence to those before; since U-Net blocks contain bidirectional attention within, independent of the transformer's attention mask, this gap is less pronounced when they are applied.

Table 5: Performance of 0.76B Transfusion models with and without intra-image bidirectional attention. Patch size is set at 2 $\times$ 2 latent pixels.

Enc/Dec	Attention	C4	Wiki	Llama	MS-COCO
Enc/Dec	Attention	PPL ( $\downarrow$ )	PPL ( $\downarrow$ )	Acc ( $\downarrow$ )	CDr ( $\uparrow$ )	FID ( $\downarrow$ )	CLIP ( $\uparrow$ )
Linear	Causal	10.4	6.0	51.4	12.7	61.3	23.0
Linear	Bidirectional	10.4	6.0	51.7	16.0	20.3	24.0
U-Net	Causal	10.3	5.9	52.0	23.3	16.8	25.3

4.3.2 Patch Size

Transfusion models can be defined over different sizes of latent pixel patches. Larger patch sizes allow the model to pack more images in each training batch and dramatically reduce inference compute, but may come at a performance cost. Table 6 sheds light on these performance trade-offs. While performance does decrease consistently as each image is represented by fewer patches with linear encoding, models with U-Net encoding benefit from larger patches on tasks involving the image modality. We posit that this is due to the greater amount of total images (and diffusion noise) seen during training. We also observe that text performance deteriorates with larger patches, perhaps because transfusion needs to exert more resources (i.e. parameters) to learn how to process images with fewer patches and thus less inference compute.

Table 6: Performance of 0.76B Transfusion models with different patch sizes. Bolded figures indicate global best, underlines indicate best within architecture.

Enc/Dec	Latent/	Pixel/	Patch/	C4	Wiki	Llama	MS-COCO
Enc/Dec	Patch	Patch	Image	PPL ( $\downarrow$ )	PPL ( $\downarrow$ )	Acc ( $\downarrow$ )	CDr ( $\uparrow$ )	FID ( $\downarrow$ )	CLIP ( $\uparrow$ )
None	1 $\times$ 1	8 $\times$ 8	1024	10.3	5.9	52.2	12.0	21.0	24.0
Linear	2 $\times$ 2	16 $\times$ 16	256	10.4	6.0	51.7	16.0	20.3	24.0
	4 $\times$ 4	32 $\times$ 32	64	10.9	6.3	49.8	14.3	25.6	22.6

4.3.3 Patch Encoding/Decoding Architecture

Our experiments so far indicate an advantage to using the U-Net up and down blocks instead of a simple linear layer. One possible reason is that the model benefits from the inductive biases of the U-Net architecure; an alternative hypothesis is that this advantage stems from the significant increase in overall model parameters introduced by the U-Net layers. To decouple these two confounders, we scale up the core transformer to 7B parameters, while keeping the amount of U-Net parameters (almost) constant;¹⁵ in this setting, the additional encoder/decoder parameters account for only a 3.8% increase of total model parameters, equivalent to the amount of token embedding parameters.

While we do not scale the U-Net layers with the transformer in these experiments, this is a potentially fruitful avenue for future research.

Table 7 shows that even though the relative benefit of U-Net layers shrinks as the transformer grows, it does not diminish. In image generation, for example, the U-Net encoder/decoder allows much smaller models to obtain better FID scores than the 7B model with linear patchification layers. We observe a similar trend in image captioning, where adding U-Net layers boosts the CIDEr score of a 1.4B transformer (1.67B combined) beyond the performance of the linear 7B model. Overall, it appears that there are indeed inductive bias benefits to U-Net encoding and decoding of images beyond the mere addition of parameters.

Table 7: Performance of linear and U-Net variants of Transfusion across different model sizes. Patch size is set at 2 $\times$ 2 latent pixels. Model parameters refers to the transformer alone.

Model	Enc/Dec	$\Delta$ Enc/Dec	C4	Wiki	Llama	MS-COCO
Params	Enc/Dec	Params	PPL ( $\downarrow$ )	PPL ( $\downarrow$ )	Acc ( $\uparrow$ )	CDr ( $\uparrow$ )	FID ( $\downarrow$ )	CLIP ( $\uparrow$ )
0.16B	Linear	0.5%	14.8	8.8	44.2	6.2	37.6	20.0
0.16B	U-Net	106.1%	14.4	8.5	45.7	15.3	18.8	23.9
0.37B	Linear	0.4%	12.0	7.0	47.9	11.1	21.5	22.4

4.3.4 Image Noising

Our experiments order 80% of image-caption pairs with the caption first, and the image conditioning on the caption, following the intuition that image generation may be a more data-hungry task than image understanding. The remaining 20% of the pairs condition the caption on the image. However, these images are noised as part of the diffusion objective. We thus measure the effect of limiting the diffusion noise to a maximum of

t = 500

(half of the noise schedule) in the 20% of cases where images appear before their captions. Table 8 shows that noise limiting significantly improves image captioning, as measure by CIDEr, while having a relatively small effect (less than 1%) on other benchmarks.

Table 8: Performance of Transfusion with and without limiting the amount of sampled diffusion noise to a maximum of $t=500$ when images appear before the caption. The models are U-Net variants encoding 2 $\times$ 2 latent pixel patches. Metrics that change by over 1% are bolded.

Model	Noise	C4	Wiki	Llama	MS-COCO
Params	Limit	PPL ( $\downarrow$ )	PPL ( $\downarrow$ )	Acc ( $\uparrow$ )	CDr ( $\uparrow$ )	FID ( $\downarrow$ )	CLIP ( $\uparrow$ )
0.76B		10.3	5.9	51.9	25.4	16.7	25.4
0.76B	$\checkmark$	10.3	5.9	52.1	29.4	16.5	25.4
7B		7.8	4.3	61.1	33.7	16.0	26.5

4.4 Comparison with Image Generation Literature

Our experiments thus far have covered controlled comparisons with Chameleon and Llama, but we have yet to compare Transfusion's image generation capabilities to those of state-of-the-art image generation models. To that end, we train a 7B parameter model with U-Net encoding/decoding layers (2

×\times

2 latent pixel patches) over the equivalent of 2T tokens, comprising of 1T text corpus tokens and 3.5B images and their captions. While the Transfusion variant in § 4.2 favored simplicity and experimental control, the design choices and data mixture (§ 4.1) of this variant lean a bit more towards image generation. Figure 2 and Appendix B showcase generated images from this model.

We compare the performance of our model to reported results of other similar scale image generation models, as well as some publicly available text generating models for reference. Table 9 shows that Transfusion achieves similar performance to high-performing image generation models such as DeepFloyd ([39]), while surpassing previously published models including SDXL ([40]). While Transfusion does lag behind SD 3 ([41]), this model leveraged synthetic image captions through backtranslation ([42]), which enhances its GenEval performance by 6.5% absolute (0.433

→\rightarrow

0.498) at smaller scale; for simplicity, our experimental setup only included natural data. Finally, we note that our Transfusion model can also generate text, and performs on par with the Llama models, which were trained on the same text data distribution (§ 4.1).

Table 9: Performance of a 7B Transfusion model (U-Net encoder/decoder layers, 2 $\times$ 2 latent pixel patches) trained on the equivalent of 2T tokens, compared to similar scale models in the literature. Except Chameleon, all the other models are restricted to generating one modality (either text or image). $^{*}$ Frozen text encoder parameters. $^{r}$ Parti samples 16 images for every prompt and then reranks with an auxiliary scoring model. $^s$ SD 3 trains with synthetic caption data, which provides boosts GenEval performance.

Model	Model	Text	Images	Llama	COCO	Gen
Model	Params	Tokens	Images	Acc ( $\uparrow$ )	FID ( $\downarrow$ )	Eval ( $\uparrow$ )
Llama 1 ([23])	7B	1.4T	—	66.1	—	—
Llama 2 ([27])	7B	2.0T	—	66.3	—	—
Chameleon ([16])	7B	6.0T	3.5B	67.1	26.74	0.39

4.5 Image Editing

Our Transfusion models, which have been pretrained on text-text, image-text, and text-image data, perform well across these modality pairings. Can these models extend their capabilities to generate images based on other images? To investigate, we fine-tuned our 7B model (§ 4.4) using a dataset of only 8k publicly available image editing examples, where each example consists of an input image, an edit prompt, and an output image. This approach, inspired by LIMA ([45]), allows us to assess how well the model can generalize to image-to-image generation, a scenario not covered during pretraining.

Manual examination of random examples from the EmuEdit test set ([46]), shown in Figure 6 and Section 4.5, reveals that our fine-tuned Transfusion model performs image edits as instructed. Despite the limitations of this experiment, the findings suggest that Transfusion models can indeed adapt to and generalize across new modality combinations. We leave further exploration of this promising direction to future research.

5. Related Work

Show me a brief summary.

In this section, the landscape of multi-modal model architectures is examined, revealing that most existing approaches combine separate modality-specific components, typically using pretrained encoders and decoders connected through projection layers, as seen in vision-language models like Flamingo and LLaVA for understanding, and GILL and DreamLLM for generation. State-of-the-art image generation models similarly rely on large pretrained text encoders to condition diffusion models, with recent work even fusing multiple off-the-shelf encoders to boost performance. End-to-end alternatives like Fuyu and Chameleon exist but face limitations: Fuyu handles only input-level tasks, while Chameleon's discrete tokenization of images underperforms compared to diffusion models for continuous generation. Meanwhile, applying diffusion to discrete text generation remains an active research area that has yet to match autoregressive language model performance. Against this backdrop, Transfusion emerges as a unified end-to-end architecture that successfully bridges discrete and continuous modalities without compromising quality.

Show me a brief summary.

In this section, the technical implementation details and additional experimental results of the Transfusion model are presented. The appendix describes the training objectives for both the VAE and VQ-GAN autoencoders, with the VAE using a combination of L1 loss, perceptual losses based on LPIPS and MoCo features, GAN loss, and KL regularization, while the VQ-GAN replaces KL regularization with codebook commitment loss to encourage alignment between encoder outputs and codebook vectors. The section showcases visual examples of the model's capabilities, including randomly generated images from a 7B parameter Transfusion model trained on 2 trillion multimodal tokens, demonstrating the quality of image generation across diverse prompts. Additionally, image editing examples are provided from a fine-tuned version of the same model, illustrating that despite training on only 8,000 editing examples, the model successfully generalizes to perform instructed image modifications on the EmuEdit test set.

Most existing multi-modal models are built on the idea of attaching two or more modality-specific architectures together, often pretraining each component separately in advance. State-of-the-art image and video generation models, for instance, use large pretrained text encoders to represent their input prompts in latent space, which can then be used to condition diffusion models ([22]). In fact, recent work fuses representations from multiple off-the-shelf encoders to enhance performance ([40, 7]). A similar pattern can be observed in the vision language model literature, where typically a pretrained language model is complemented by pretrained modality-specific encoders/decoders via projection layers to/from the pretrained text space. Examples include Flamingo ([47]) and LLaVA ([48]) for visual understanding, GILL ([11]) for visual generation, and DreamLLM ([49]) for both visual comprehension and generation. In contrast, Transfusion has one unified architecture learned end-to-end to generate both text and images.

Prior work on end-to-end multi-modal models includes examples such as Fuyu ([50]), which uses image patches as inputs for visual understanding, and Chameleon ([16]), which converts each image to a sequence of discretized tokens and then trains over the combined text-image token sequences. However, these approaches are either restricted to input-level multi-modal tasks, or lag behind state-of-the-art models (i.e. diffusion models) in continuous data generation. Transfusion provides a simple, end-to-end solution to multi-modal learning that understands and generates high-quality multi-modal data.

An interesting area of recent acrive research is the application diffusion models and their generalizations to discrete text generation ([51, 52]). However, this approach has yet to achieve the performance and scale of standard autoregressive language models. Future research in this direction may unlock new ways to fuse discrete and continuous modalities in a single model.

6. Conclusion

Show me a brief summary.

In this section, the authors conclude their exploration of bridging discrete sequence modeling through next token prediction with continuous media generation via diffusion methods. They propose Transfusion, a unified model architecture that trains on dual objectives simultaneously, matching each modality—text or images—to its most effective learning paradigm rather than forcing both through a single approach. The key insight is that this modality-specific objective pairing, though simple and previously unexplored, eliminates the need to compromise between discrete and continuous data generation methods. Experimental results demonstrate that Transfusion scales efficiently across model sizes, incurring minimal to no performance penalty from parameter sharing between modalities, while successfully enabling generation of both text and images within a single end-to-end framework. This approach offers a practical solution to multi-modal learning that respects the distinct characteristics of different data types while maintaining computational efficiency.

This work explores how to bridge the gap between the state of the art in discrete sequence modeling (next token prediction) and continuous media generation (diffusion). We propose a simple, yet previously unexplored solution: train a single joint model on two objectives, tying each modality to its preferred objective. Our experiments show that Transfusion scales efficiently, incurring little to no parameter sharing cost, while enabling the generation of any modality.

Acknowledgments

We would like to thank Horace He, Songlin Yang, Jiatao Gu, and Ishan Misra for helpful discussions throughout this project.

Appendix

A. Autoencoder Details

The training objective for our VAE closely follows that of [21]:

where

L_1

is L1 loss in pixel space,

LLPIPSL_{\text{LPIPS}}

is perceptual loss based on LPIPS similarity [53],

L_{GAN}

is a patch-based discriminator loss,

LIDL_{\text{ID}}

is a perceptual loss based on internal features of the Moco v2 model [54], and

LKLL_{\text{KL}}

is the standard KL-regularization term to encourage encoder outputs towards a normal distribution. We delay the beginning of GAN training (i.e. including the adversarial loss in the loss function) to 50, 000 steps, in order to let the VAE achieve sufficiently good reconstruction performance. We use a latent dimension of 8.

The training objective for the VQ-GAN matches that of the VAE, with one notable exception: we replace the

LKL\mathcal{L}_{\text{KL}}

loss with the standard codebook commitment loss

Lcodebook\mathcal{L}_{\text{codebook}}

([12]), which encourages encoder outputs and codebook vectors to be close together. We use

β=0.25\beta = 0.25

, and use loss weighting

1.0

. The final loss function for the VQ-VAE is therefore:

The vector quantization layer is applied after projecting the encoder outputs to 8-dimensional space. Outside of the loss function change and the quantization layer, the training setup for the VAE (for Transfusion) and VQ-VAE (for Chameleon) are the same (e.g. same amount of training compute, same training data, and same encoder/decoder architecture).

B. Examples: Image Generation

Figure 7 and Figure 8 show examples of images generated from a 7B Transfusion model trained on 2T multi-modal tokens (§ 4.4).

C. Examples: Image Editing

Figure 9 show random examples of image editing by a fine-tuned 7B Transfusion model.

References

[1] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

[2] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

[3] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.

[4] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.

[5] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.

[6] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.

[7] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024b. URL https://arxiv.org/abs/2403.03206

[8] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.

[9] Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023.

[10] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.

[11] Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36, 2024.

[12] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.

[13] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.

[14] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.

[15] Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.

[16] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.

[17] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2023.

[18] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.

[19] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.

[20] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[21] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.

[22] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.

[23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.

[24] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.

[25] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.

[26] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683

[27] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.

[28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.

[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

[30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.

[31] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.

[32] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL-2019). Association for Computational Linguistics, 2019.

[33] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, pages 7432–7439, 2020.

[34] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.

[35] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

[36] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.

[37] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.

[38] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. CoRR, abs/2102.08981, 2021. URL https://arxiv.org/abs/2102.08981

[39] Stability AI. If by deepfloyd lab at stabilityai, 2024. URL https://stability.ai/news/deepfloyd-if-text-to-image-model

[40] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.

[41] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024a.

[42] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023. URL https://api.semanticscholar.org/CorpusID:264403242

[43] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.

[44] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. URL https://arxiv.org/abs/2204.06125

[45] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.

[46] Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024.

[47] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.

[48] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.

[49] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. In The Twelfth International Conference on Learning Representations, 2024.

[50] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşı{}rlar. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b

[51] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-lm improves controllable text generation. ArXiv, abs/2205.14217, 2022.

[52] Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching. arXiv preprint arXiv:2407.15595, 2024.

[53] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.

[54] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.