UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS

Show me an executive summary.

1) Purpose and scope This summary reviews the Unified-IO paper, which introduces a single AI model for diverse tasks in computer vision, vision-language, and natural language processing. It highlights design choices, results, and implications for building general-purpose AI systems.

2) Key questions or objectives Can one model handle many tasks—like image generation, object detection, depth estimation, question answering, and text classification—without custom branches for each? Does joint training on varied data improve performance and transfer knowledge across tasks?

3) Methods overview Researchers convert all inputs and outputs to token sequences: text via byte-pair encoding, images and pixel maps via a vector quantization tool, and coordinates (like boxes) via special tokens. They train transformer models, similar to T5, in two stages: pre-training on denoising tasks with web data, then multi-task training on 95 datasets (130 million examples). Four model sizes range from 71 million to 2.9 billion parameters. No task-specific fine-tuning occurs.

4) Main results The largest model supports all seven tasks on the GRIT benchmark, scoring 64.3 on average—first to do so and 32 points above the next best. It excels on new concepts unseen in training data, outperforming specialized models like Mask R-CNN. On 16 other benchmarks (e.g., ImageNet at 79% accuracy, VQA at 78%), it nears or beats unified rivals and competes with fine-tuned leaders, despite multi-tasking.

5) Interpretation of findings Joint training enables knowledge sharing across tasks and data types, boosting generalization to new scenarios. This cuts development time and costs by avoiding per-task models. Performance holds despite distractions from 90+ tasks, but trails top single-task models on some metrics like depth accuracy. Results show unified models work for broad AI but need scale to match specialists.

6) Recommendations and next steps Use unified token-based architectures for new multi-task projects to speed development and improve robustness. Prioritize the 2.9 billion-parameter version for production. Test on internal data; consider light fine-tuning if single-task gains exceed 5-10%. Next steps: Train a better image tokenizer, add data augmentation for cluttered scenes, and study prompt variations. Pilot on one vision-language workflow before full rollout.

7) Limitations and confidence Model struggles with cluttered detection recall, weak image generation from a fixed tokenizer, and prompt sensitivity. Results rely on specific datasets; unseen tasks may vary. High confidence in benchmark performance (direct evaluations); medium confidence in broader applications without testing.

Jiasen Lu $†∗{}^{\dagger*}$ , Christopher Clark $†∗{}^{\dagger*}$ , Rowan Zellers $†⋄{}^{\dagger\diamond}$ , Roozbeh Mottaghi $†⋄{}^{\dagger\diamond}$ , Aniruddha Kembhavi $†⋄{}^{\dagger\diamond}$

†{}^{\dagger}

Allen Institute for AI,

⋄{}^{\diamond}

University of Washington, Seattle

∗{}^{*}

Equal contribution. Correspondence to [email protected]

Abstract

Show me a brief summary.

In this section, developing a single model for diverse AI tasks—from computer vision like pose estimation, object detection, depth estimation, and image generation, to vision-language tasks such as region captioning and referring expressions, and NLP like question answering and paraphrasing—faces challenges from heterogeneous inputs and outputs including RGB images, per-pixel maps, masks, boxes, and language. Unified-IO overcomes this by converting all inputs and outputs into sequences of discrete vocabulary tokens, enabling training of one transformer architecture on over 90 vision and language datasets. It becomes the first model to complete all 7 GRIT benchmark tasks and delivers strong zero-shot results across 16 benchmarks like NYUv2-Depth, ImageNet, VQA2.0, and others, with code available at unified-io.allenai.org.

We propose

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

are available at: unified-io.allenai.org

1. Introduction

Show me a brief summary.

In this section, unifying diverse AI tasks across computer vision, vision-language, and NLP proves challenging due to heterogeneous inputs and outputs like images, masks, boxes, and text. Unified-IO addresses this by tokenizing all modalities into discrete sequences—using VQ-VAE for dense structures, coordinate tokens for sparse ones, and byte-pair encoding for language—enabling a single transformer encoder-decoder to jointly train on over 90 datasets without task- or modality-specific branches. The result is the first model supporting all 7 GRIT benchmark tasks with a leading average score of 64.3, outperforming the next best by 32.0 points, and strong zero-shot performance on 16 diverse benchmarks rivaling specialized models.

We present

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

, the first neural model to jointly perform a large and diverse set of AI tasks spanning classical computer vision (such as object detection, segmentation, and depth estimation), image synthesis (such as image generation and image in-painting), vision-and-language (like visual question answering, image captioning, and referring expression) and NLP (such as question answering and paraphrasing). Unified general-purpose models avoid the need for task-specific design, learn and perform a wide range of tasks with a single architecture, can utilize large, diverse data corpora, can effectively transfer concept knowledge across tasks, and even perform tasks unknown and unobserved at design and training time.

$**Figure 1:** **$\textsc{Unified-IO}$** is a single sequence-to-sequence model that performs a variety of tasks in computer vision and NLP using a unified architecture without a need for either task or modality-specific branches. This broad unification is achieved by homogenizing every task’s input and output into a sequence of discrete vocabulary tokens. $\textsc{Unified-IO}$ supports modalities as diverse as images, masks, keypoints, boxes, and text, and tasks as varied as depth estimation, inpainting, semantic segmentation, captioning, and reading comprehension.$

Figure 1: $\text{U{\scriptsize NIFIED-IO}}$ is a single sequence-to-sequence model that performs a variety of tasks in computer vision and NLP using a unified architecture without a need for either task or modality-specific branches. This broad unification is achieved by homogenizing every task’s input and output into a sequence of discrete vocabulary tokens. $\text{U{\scriptsize NIFIED-IO}}$ supports modalities as diverse as images, masks, keypoints, boxes, and text, and tasks as varied as depth estimation, inpainting, semantic segmentation, captioning, and reading comprehension.

💭 Click to ask about this figure

Building unified models for computer vision has proven to be quite challenging since vision tasks have incredibly diverse input and output representations. For instance, object detection produces bounding boxes around objects in an image, segmentation produces binary masks outlining regions in an image, visual question answering produces an answer as text, and depth estimation produces a map detailing the distance of each pixel from the camera. This heterogeneity makes it very challenging to architect a single model for all these tasks. In contrast, while the landscape of natural language processing (NLP) tasks, datasets, and benchmarks is large and diverse, their inputs and desired outputs can often be uniformly represented as sequences of tokens. Sequence to sequence (Seq2Seq) architectures ([1, 2]), specifically designed to accept and produce such sequences of tokens, are thus widely applicable to many tasks. Unified models employing such architectures have been central to much recent progress in NLP.

Unified models for computer vision typically use a shared visual backbone to produce visual embeddings but then employ individual branches for each of the desired tasks. These include models like Mask R-CNN ([3]) for classical visual tasks that use an ImageNet pre-trained encoder followed by branches for detection and segmentation, trained in a fully supervised manner. In the vision and language (V&L) domain, CNN backbones feed visual features to transformer architectures that also combine language, followed by task-specific heads for visual question answering, referring expression, visual commonsense reasoning, etc. ([4, 5, 6]). A more recent trend has seen the emergence of unified architectures that do away with task-specific heads and instead introduce modality-specific heads ([7, 8, 9, 10]) – for instance, a single language decoder that serves multiple tasks requiring language output like captioning and classification. However, most progress in unified models continues to be centered around V&L tasks, owing to the simplicity of building shared language decoders, and is often limited to supporting just a handful of tasks.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is a Seq2Seq model capable of performing a variety of tasks using a unified architecture without a need for either task or even modality-specific branches. This broad unification is achieved by homogenizing every task's output into a sequence of discrete tokens. Dense structured outputs such as images, segmentation masks and depth maps are converted to sequences using a vector quantization variational auto-encoder (VQ-VAE) ([11]), sparse structured outputs such as bounding boxes, and human joint locations are transcribed into sequences of coordinate tokens, and language outputs are converted to sequences using byte-pair encoding. This unification enables Unified-IO to jointly train on over 90 datasets spanning computer vision, V&L, and NLP tasks with a single streamlined transformer encoder-decoder architecture ([1]).

Our jointly trained

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is the first model to support all 7 tasks in the General Robust Image Task (GRIT) Benchmark ([12]) and obtains the top overall score of 64.3 when averaging across all tasks, handily beating the second best model by 32.0. We further evaluate

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

on 16 diverse benchmarks across computer vision and NLP, without any fine-tuning towards any individual benchmark, and find that it performs remarkably well compared to specialized (or fine-tuned) state-of-the-art models.

2. Vision, Language and Multi-Modal Tasks

Show me a brief summary.

In this section, Unified-IO addresses the challenge of unifying diverse vision, language, and multi-modal tasks by curating 95 datasets from 62 public sources into 8 groups and 22 tasks spanning image synthesis, sparse and dense labelling, classification, captioning, vision-language reasoning, NLP, and language modeling. Inputs and outputs are categorized into four modalities—text, images, sparse coordinates, and dense per-pixel maps—with Table 1 detailing dataset counts, example sizes (totaling 130 million), training sampling rates, and modality requirements. This structured aggregation enables joint training on heterogeneous data, facilitating broad task coverage and knowledge transfer across computer vision, vision-language, and pure language domains.

Table 1: Tasks \mbox{\sc{Unified-IO}}\ learns to complete. From left to right, columns show an example of one of the sources used for the task, the number of datasets, total number and percent of examples relative to the entire training corpus, and sample rate during multi-task training. Subsequent columns show what modalities are required for the tasks, and highlighted rows show aggregated statistics for groups of similar tasks.

Example Source

Size

Input Modalities

Output Modalities

Datasets

Size

Percent

Rate

Text

Image

Sparse

Dense

Text

Image

Sparse

Dense

Image Synthesis

56m

43.0

18.7

✓

Image Synthesis from Text

RedCaps

55m

41.9

16.7

✓

Image Inpainting

1.2m

0.9

1.5

✓

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is designed to handle a wide range of language, vision and language, and classic vision tasks in a unified way. To fully test this capability, we gather 95 vision, language, and multi-modal datasets from 62 publicly available data sources as targets for our model to learn during multi-task training. These datasets cover a wide range of tasks, skills, and modalities.

We categorize the input and output modalities of each task into 4 different types: Text – natural language tokens; Image – RGB images; Sparse – a small number of location coordinates within the image; Dense – per-pixel labels such as depth maps, surface normal maps, etc. We group related datasets into 8 groups and 22 tasks to facilitate our training and analysis:

Image Synthesis. Given a text description, partially occluded image and inpainting target, or segmentation map containing a semantic class for some pixels, generate a matching image. Data sources with image and text pairs ([13]), bounding boxes ([14]) or semantic segmentation ([15]) can be used to build these tasks.

Sparse Labelling. Given an image and a natural language query, identify the target regions or keypoint locations that are being referred to. Tasks include object detection ([16]), object localization ([17]), human pose estimation ([18]) and referring expression ([19]).

Dense Labelling. Given an image, produce per-pixel labels for that image. Labels include the distance of that pixel to the camera ([20]), surface orientation ([21]) or semantic class ([18]).

Image Classification. Given an image and optionally a target bounding box, generate a class name or tag of that image or target region. This group includes image classification ([22]) and object categorization ([23]) datasets.

Image Captioning. Given an image and optionally a bounding box, generate a natural language description of that image or target region. We include both crowd-sourced ([24]) and webly supervised ([25]) captions.

Vision & Language. A broad category for other tasks that require jointly reason over image content and a natural language query. There are many popular vision and language datasets, and we categories these datasets into 3 tasks – visual question answering ([26]); relationship detection ([27]) and grounded VQA ([28]).

NLP. Tasks with text as the only input and output modalities, including text classification ([29]), question answering ([30]) and text summarization ([31]).

Language Modeling. The masking language modeling pre-training task (See Section 3.3) using text from C4 ([1]) and Wikipedia ([32]), which we include to ensure the knowledge gained from language pre-training is not lost during multi-task training. Other pre-training tasks are not included because the relevant datasets are already used in other supervised tasks (e.g, for captioning or classification).

Table 1 shows the details of tasks and groups. We list an example dataset source, number of datasets, number of examples, percent of the total number of examples, and sampling rate during training (Section 3.3) for each group and task. Subsequent columns show what modalities are required for the inputs and outputs. We defer additional task details, inference details, the complete list of datasets and visualizations to the Appendix A.1.

3. Unified-IO

Show me a brief summary.

In this section, the challenge of creating a single model to handle diverse computer vision, language, and multimodal tasks without task-specific customizations is addressed by unifying all inputs and outputs as sequences of discrete tokens: text via SentencePiece, dense structures like images and maps via a pretrained VQ-GAN, and sparse elements like boxes via location tokens. This enables a pure transformer encoder-decoder architecture adapted from T5, with image patches, expanded vocabulary, and 2D position embeddings, scaled across four variants from 71M to 2.9B parameters. Training proceeds in pre-training on denoising objectives for text and images, followed by multitask learning on 95 datasets with balanced group sampling and no fine-tuning, using Adafactor optimization, decaying learning rates, and parallelism for efficiency over 1M steps total, yielding a versatile model primed for broad generalization.

Show me a brief summary.

In this section, the Appendix comprehensively details Unified-IO's training regimen across diverse vision, language, and multimodal tasks to enable unified processing of varied inputs and outputs like images, masks, text, and keypoints. It categorizes tasks into image synthesis, sparse/dense labeling, classification, captioning, vision-language, NLP, and language modeling, specifying prompts, datasets (e.g., COCO, ImageNet, SQuAD), and tokenization strategies, while visualizing pre-training (text/image denoising balances) and multi-task sampling distributions via proportional pie charts that prioritize underrepresented tasks. Qualitative examples across figures showcase accurate predictions, from inpainting and detection to QA and summarization, affirming the model's robust generalization despite joint training on 90+ datasets.

Our goal is to build a single unified model that can support a diverse set of tasks across computer vision and language with little to no need for task-specific customizations and parameters. Such unified architectures can be applied to new tasks with little to no knowledge of the underlying machinery, enable general pre-training to benefit many diverse downstream applications, be jointly trained on a large number of tasks, and better allows knowledge to be shared between tasks.

3.1 Unified Task Representations

Supporting a variety of modalities such as images, language, boxes, binary masks, segmentation masks, etc without task-specific heads requires representing these modalities in a shared and unified space. To do this, we discretize the text, images, and other structured outputs in our tasks and represent them with tokens drawn from a unified and finite vocabulary.

Text representation. Following [1], text inputs and outputs are tokenized using SentencePiece ([33]). Following past works such as [34, 1, 9, 10] we also specify each task with a natural language prompt (excluding some tasks like VQA, which are fully specified by their text inputs) in order to indicate what task should be performed. For example, "What is the depth map of the image?" for depth estimation or "What region does ``cat" describe?" for object localization.

Images and dense structures representation. A variety of tasks in computer vision requires the model to produce high-dimensional outputs such as images (e.g, image in-painting) or per-pixel labels (e.g, depth estimation). To handle these modalities, we first convert per-pixel labels into RGB images. For depth, we construct a grayscale image by normalizing the depth map. For surface normal estimation, we convert the

x / y / z

orientations into

r / g / b

values. For segmentation, we map each instance present in the image to a unique color. We randomly select colors for each instance and specify the color-to-class mapping in the text instead of using universal color-to-class mapping. This avoids requiring a fixed list of classes and avoids having colors that may only be marginally different due to the presence of a large number of classes.

Then we encode these images as discrete tokens using a

VQ-GAN\text{V{\scriptsize Q-GAN}}

. In particular, we use the imagenet-pretrained

VQ-GAN\text{V{\scriptsize Q-GAN}}

from [11] with

256 \times 256

resolution, compression ratio of

16

, and

16384

codebook size. The

VQ-GAN\text{V{\scriptsize Q-GAN}}

codebook is added to the vocabulary as additional tokens that can be generated by the decoder. During training, the tokens for the target image are used as targets. During inference, the

VQ-GAN\text{V{\scriptsize Q-GAN}}

decoder is used to convert the generated image tokens into an output image.

Sparse structures representation. We encode sparse structures such as bounding boxes or human joints by adding 1000 special tokens to the vocabulary to represent discretized image coordinates ([35]). Points are then encoded with a sequence of two such tokens, one for the

x

and one for the

y

coordinates, and boxes are encoded using a sequence of four tokens, two for the upper right corner and two for the lower left corner. Labeled boxes are encoded as a box followed by a text class label, and joints are encoded as a sequence of points followed by a text visibility label. This allows us to handle a wide variety of tasks that use these elements in their inputs or output (see Appendix A.1 for examples).

3.2 Unified Architecture

Universally representing a wide variety of tasks as input and output sequences of discrete tokens enables us to employ architectures that have been proven successful in natural language processing. In

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

, we propose a pure transformer model largely following the design of T5 ([1]). In particular,

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is an encoder-decoder architecture where both the encoder and decoder are composed of stacked transformer layers, which in turn are composed of self-attention transformers, cross-attention transformers (in the decoder), and feed-forward neural networks. The layers are applied residually, and layer norms are applied before each transformer and feed-forward network. See [1] for details.

We make a few architectural changes to adapt the T5 architecture to our setting. First, to handle input images, we reshape the image into a sequence of patches that are embedded with linear projection similar to [36]. Second, we expand the vocabulary to include the location tokens and the image tokens used in the VQ-GAN. Third, we extend the 1-d relative embedding ([36]) to 2-d with a fixed number of learned embeddings. We also add absolute position embedding to the token embedding following [37], since the absolute position information is essential to image tasks.

We use a maximum of 256 and 128 text tokens for inputs and outputs respectively, and a maximum length of 576 (i.e

24 \times 24

patch encoding from a

384 \times 384

image) for image inputs and 256 (i.e

16 \times 16

latent codes from a

256 \times 256

image) for image outputs. In this work, we present four versions of

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

ranging from 71 million to 2.9 billion parameters, as detailed in Table 2.

3.3 Training

Table 2: Size variant of $\text{U{\scriptsize NIFIED-IO}}$ . Both encoder and decoder are based on T5 implementation ([1]). Parameters of VQ-GAN ([11]) are not included in the total parameter count.

Model	Encoder Layers	Decoder Layers	Model Dims	MLP Dims	Heads	Total Params
$\text{U{\scriptsize NIFIED-IO}} _{\texttt{SMALL}}$	8	8	512	1024	6	71M
$\text{U{\scriptsize NIFIED-IO}} _{\texttt{BASE}}$	12	12	768	2048	12	241M
$\text{U{\scriptsize NIFIED-IO}} _{\texttt{LARGE}}$	24	24	1024	2816	16	776M

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained in two stages – A pre-training stage that uses unsupervised losses from text, image, and paired image-text data, and a massive multi-task stage where the model is jointly trained on a large variety of tasks. Since our goal is to examine whether a single unified model can solve a variety of tasks simultaneously, we do not perform task-specific fine-tuning although prior work ([38, 10]) shows it can further improve task performance.

Pre-training. To learn good representations from large-scale webly supervised image and text data, we consider two pre-training tasks: text span denoising and masked image denoising. The text span denoising task follows [1] – randomly corrupt 15% of the tokens and replace the consecutive corrupted tokens with a unique mask token. The masked image denoising task follows [39] and [40] – randomly masked 75% of the image patches, and the goal is to recover the whole image. When another modality is present, i.e image or text, the model can use information from that modality to complete the tasks.

We construct the pre-training dataset by incorporating publicly available language data (i.e., plain texts from Common Crawl), vision data (i.e., raw images from different datasets), and V&L data (i.e., image caption and image label pairs). For V&L data, we add a simple prompt "An image of" at the beginning of caption or categories to indicate it is multi-modal data ([41]).

We pre-train

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

on this combination of datasets with an in-batch mixing strategy. We equally sample data with the text and image denoising objective. For text denoising, half of the samples are from pure text data, i.e C4 and Wikipedia. The other half is constructed from image and class data, such as Imagenet21k ([42]) or image and caption data, such as YFCC15M ([43]). For image denoising, we also use the same caption and class data and some image-only data from datasets for our vision tasks. We sample from datasets in proportion to dataset size. See Appendix A.2 for details.

Multi-tasking. To build a single unified model for diverse vision, language, and V&L tasks, we construct a massive multi-tasking dataset by ensembling 95 datasets from 62 publicly available data sources. See Section 2 for task details and Appendix A.1 for dataset visualizations.

We jointly train

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

on this large set of datasets by mixing examples from these datasets within each batch. We equally sample each group (

1/8

) except for image synthesis (

3/16

) and dense labeling (

1/16

) since dense labeling has significantly fewer data and image synthesis has significantly more data than other groups. Within each group, we sample datasets proportional to the square root of their size to better expose the model to underrepresented tasks. Due to the large variance in dataset size, some tasks are still rarely sampled (e.g depth estimation only has a

0.43%0.43\%

chance of being sampled). See Appendix A.3 for details and visualizations.

3.4 Implementation Details

The total vocabulary size is 49536, with 32152 language tokens, 1000 location tokens, and 16384 vision tokens. During training, we random sub-sample 128 image patches for pre-training state and 256 image patches (out of 576) for multi-task stage. We do not use dropout. Adafactor ([44]) optimizer is used to save memory. We use a learning rate of

10^{-2}

for the first 10, 000 steps and then decay at a rate of

1/k1/\sqrt{k}

. We train with

β1=0.9\beta_1 = 0.9

and

β2=1.0−k−0.8\beta_2 = 1.0-k^{-0.8}

, where

k

is the step number. We use global norm gradient clipping with 1.0 and find this is crucial to stabilized XL training. We train the Small, Base and Large with batch size of 2048 and XL with batch size of 1024 due to memory consideration. 4-way in-layer parallelism and 128-way data parallelism used to scale the 3B model training. For all models, we train

1000 k

steps –

500 k

for pre-training and multi-task training respectively.

4. Experiments

Show me a brief summary.

In this section, experiments rigorously assess Unified-IO's unified handling of diverse vision, language, and multimodal tasks without task-specific fine-tuning. Core evaluations on the GRIT benchmark position it as the first model supporting all seven tasks, with the XL variant topping scores at 64.3% average accuracy, surpassing priors like GPV-2 and showing robust generalization to new concepts via minimal performance drops. Ablations on task groups affirm multi-task training's efficacy despite heterogeneity, while results on 16 additional benchmarks yield competitive marks against specialized SOTA models. Overall, Unified-IO proves a versatile foundation model, though limited by recall in clutter and VQ-GAN constraints.

We now present results for

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

on the GRIT benchmark (Section 4.1), evaluation on same concept and new concept (Section 4.2), ablate training data via the GRIT ablation benchmark (Section 4.3) and evaluate

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

on 16 other benchmarks in computer vision and NLP (Section 4.4). Section 4.5 shows the prompt generalization on refer expression. Qualitative examples are in Appendix A.4.

4.1 Results on GRIT

The General Robust Image Task (GRIT) Benchmark ([12]) is an evaluation-only benchmark designed to measure the performance of models across multiple tasks, concepts, and data sources. GRIT aims to encourage the building of unified and general purpose vision models and is thus well suited to evaluate

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

. GRIT has seven tasks that cover a range of visual skills with varying input and output modalities and formats: categorization, localization, VQA, refer expression, segmentation, keypoint, and surface normal estimation.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is the first model to support all seven tasks in GRIT. As seen in Table 3,

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

XL_\texttt{XL}

outperforms all prior submissions to GRIT obtaining average accuracy of 64.3 on test. The next best submission is GPV-2 ([45]) which obtains 32.0 and can only support 4 out of 7 tasks.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

XL_\texttt{XL}

also outperforms the multi-task checkpoint of OFA

LARGE_\texttt{LARGE}

([10]) on VQA, refer expression and categorization.

Table 3: Comparison of our \mbox{\sc{Unified-IO}} models to recent SOTA on GRIT benchmark. \mbox{\sc{Unified-IO}} is the first model to support all seven tasks in GRIT. Results of CLIP, OFA obtained from GRIT challenge.

		Categorization		Localization		VQA		Refexp		Segmentation		Keypoint		Normal		All
		ablation	test	ablation	test	ablation	test	ablation	test	ablation	test	ablation	test	ablation	test	ablation	test
`0`	NLL-AngMF [[21]]	-	-	-	-	-	-	-	-	-	-	-	-	49.6	50.5	7.2	7.1
`1`	Mask R-CNN [[3]]	-	-	44.7	45.1	-	-	-	-	26.2	26.2	70.8	70.6	-	-	20.2	20.3
`2`	GPV-1 [[9]]	33.2	33.2	42.8	42.7	50.6	49.8	25.8	26.8	-	-	-	-	-	-	21.8	21.8

Mask R-CNN ([3]) is a strong baseline for core vision tasks.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

XL_\texttt{XL}

outperforms Mask R-CNN on localization and segmentation. The reason is

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

XL_\texttt{XL}

shows little degradation in performance between same concept and new concept as discussed in Section 4.2. On keypoint, our model is worse compared to Mask R-CNN (68.1 vs 70.8). The reason is we have 2-stage inference for keypoint – first locate the person using the object localization prompt, then find keypoints for each detected region.

NLL-AngMF ([21]) is a SOTA model for surface normal estimation. Our model gets strong results compared to NLL-AngMF (44.3 vs 49.6). Since our image tokenizer is only pre-trained on ImageNet without any surface normal data, the upper bound of our method through reconstruction is 59.8 on FrameNet ([19]). This suggests our score could be considerably improved by training a stronger image tokenizer.

4.2 Evaluation on same concept and new concept

Table 4: Generalization to new concepts on the GRIT ablation set.

		restricted	params (M)	Categorization		Localization		VQA		Refexp		Segmentation		Keypoint		Normal
		restricted	params (M)	same	new	same	new	same	new	same	new	same	new	same	new	same	new
`0`	NLL-AngMF	✓	72	-	-	-	-	-	-	-	-	-	-	-	-	50.7	-
`1`	Mask R-CNN	✓	58	-	-	51.9	40.8	-	-	-	-	44.9	0.3	70.9	-	-	-
`2`	GPV-1	✓	236	58.7	0.8	48.3	37.8	58.4	74.0	29.7	23.1	-	-	-	-	-	-

GRIT provides a breakdown of metrics into two groups: same for samples that only contain concepts seen in the primary training data (a set of common datasets like COCO, ImageNet and Visual Genome), and new for samples containing at least one concept unseen in primary training data. Table 4 shows results for

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

and other leaderboard entries for the ablation set, divided into same and new concepts.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

XL_\texttt{XL}

shows little degradation in performance between same and new, compared to competing entries. On some tasks

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is even able to outperform on the new split compared to the same. This indicates that the volume of training data used to train

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

has a broad coverage of concepts, and provides almost as effective a level of supervision as provided by large standard vision datasets like COCO. Furthermore, since

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is a uniquely unified architecture with no task-specific parameters, it is very likely able to effectively transfer knowledge across different tasks.

In comparison to Mask-RCNN (row 1), GRIT metrics show

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

(row 14) is better by a large margin on new concepts, i.e., non-COCO examples (74.4 vs 40.8 for localization and 64.2 vs 0.3 on segmentation), but is still superior on the COCO-like examples (65.6 vs 51.9 for localization and 53.0 vs 44.9 on segmentation).

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is also able to beat GPV-2 (row 5) on new concepts by large margins across all 4 tasks supported by GPV-2 even though GPV-2 is exposed to these concepts via webly supervised data and is designed to transfer concept knowledge across skills.

4.3 Ablations on Task Group

Table 5: Ablation study on holding out tasks groups and evaluating on GRIT and MNLI ([29])

Model	Step	Categorization	Localization	VQA	Refexp	Segmentation	Keypoint	Normal	MNLI
Unified-IO $_{\texttt{LARGE}}$	250k	50.3	63.4	65.7	73.4	51.8	69.2	40.7	85.1
w/o Image Synthesis	200k	52.7	62.9	64.2	72.0	53.6	18.3	42.2	84.3
w/o Sparse	220k	52.6	-	64.1	-	51.3	-	38.5	83.8
w/o Dense	235k	49.5	62.4	65.6	72.9	-	66.7	-	84.8

To better understand how multi-tasking affects learning, we perform ablations by leaving out individual task groups from multi-task training. Due to computational constraints, we ablate

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

LARGE{_\texttt{LARGE}}

and train for 250k steps. If ablating a task group, we reduce the number of training steps so that all models are trained on approximately the same number of examples for each of the remaining task groups. Results are shown in Table 5 on GRIT and MNLI ([29]).

In spite of supporting a large number of heterogeneous tasks, Unified-IO is able to perform well across all tasks. Reducing this heterogeneity by removing task groups does not impact the performance of individual tasks significantly. This is notable since removing a task group significantly reduces the scope of what a model needs to learn while keeping the model capacity fixed. This empirically demonstrates the effectiveness of the proposed unified architecture for massive heterogeneous task support.

An exception is that removing the NLP group significantly boosts categorization, which might indicate that the sentence classification task interferes with image classification. Removing captioning also boosts performances on VQA and a few other tasks, which might be caused by captioning requiring a relatively large amount of model capacity to learn free-form text generation, in contrast to VQA that requires short answer phrases from a limited vocabulary. Removing image synthesis causes a major regression in keypoint. Manual inspection shows that the model predicts standing-human shaped keypoints even for people in very different postures, suggesting the model learned to rely on priors instead of the image content. We also see minor regressions in localization and referring expression, suggesting that image synthesis tasks, possibly image in-painting in particular, had a surprising positive transfer to understanding sparse structured outputs. It is possible that an ablation analysis on the XL model may yield different outcomes, but we are unable to perform an XL-based analysis due to limited compute.

4.4 Results on Additional Tasks

Table 6: Comparing the jointly trained \mbox{\sc{Unified-IO}}\ to specialized and benchmark fine-tuned state of the art models across Vision, V&L and Language tasks. Benchmarks used for evaluation are: NYUv2 ([20]), ImageNet ([22]), Places365 ([46]), VQA 2.0 ([47]), A-OKVQA ([48]), VizWizVQA ([49]), VizWizG ([28]), Swig ([50]), SNLI-VE ([51]), VisComet ([52]), Nocaps ([53]), COCO Captions ([24]), MRPC ([54]), BoolQ ([55]), and SciTail ([56]).


Split	val	val	val	test-dev	test	test	test-dev	test-std	test	val	val	val	val	test	val	val	test
Metric	RMSE	Acc.	Acc.	Acc.	Acc.	Acc.	Acc.	IOU	Acc.	Acc.	CIDEr	CIDEr	CIDEr	CIDEr	F1	Acc	Acc
Unified SOTA	UViM	-	-	-	Flamingo	-	Flamingo	-	-	-	-	-	-	-	T5	PaLM	-
	0.467	-	-	-	57.8	-	49.8	-	-	-	-	-	-	-	92.20	92.2	-

We report results on 16 additional tasks used in our training setup. For these tasks, we do not expect to get state-of-the-art results since specialized models are usually designed and hyper-parameter tuned for a single task, while we are evaluating a single jointly trained model. We also avoid extensive task-specific tricks like color jittering, horizontal flipping, CIDEr optimization, and label smoothing, which are often responsible for considerable gains in individual task performance. We leave such task-specific tuning for future work. See Table 6 for the results. When possible, we additionally report the best prior result on these tasks from a unified model, meaning a model that is trained in a multi-task setting and a unified architecture (no task-specific head or customizations) with at least three other tasks.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

provides strong performance on all these tasks despite being massively multi-tasked. We review more fine-grained results below.

Depth Estimation. On depth estimation,

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

achieves 0.385 rmse, which is behind SOTA ([57]) but ahead of the recently proposed unified model, UViM ([58]), despite being trained to do far more tasks.

Image Classification.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

achieves 79.1 on ImageNet and 53.2 on Places365, showing the model was able to retain the knowledge of many fine-grained classes despite being massively multi-tasked. Notably, we achieve this without the extensive data augmentations methods typically used by SOTA models ([59, 40]).

Visual Question Answering.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is competitive with fine-tuned models on VQA ([60, 45, 61]), and achieves SOTA results on A-OKVQA. Relative to Flamingo,

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

performs better on VizWiz-QA but worse on OK-VQA.

Image Captioning. Despite the lack of CIDEr optimization,

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is a strong captioning model and generalizes well to nocaps. Since

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained on many captioning datasets, it is likely the use of style tags following [62] would offer additional improvement by signaling

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

to specifically generate COCO-style captions during inference.

NLP tasks.:

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

achieves respectable results on three NLP tasks but lags behind SOTA models ([63, 64, 65]). This can partly be attributed to scale. Modern NLP models contain 100 billion+ parameters and with more extensive NLP pre-training.

4.5 Prompt Generalization Case Study

Table 7: Case study on GRIT referring expressions using different prompts. The first prompt is the one used during training, the others are paraphrases. REFEXP is replaced by the referring expression text of individual examples during evaluation.

	Prompt	Refexp Score
`0`	Which region does the text “ `REFEXP` " describe ?	78.9
`1`	Which region does the text “`REFEXP`" describe?	76.7
`2`	Which region matches the text “ `REFEXP` " ?	77.4
`3`	Locate the “ `REFEXP` " .	65.6

To better understand how different prompts affect

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

, we do a case study on referring expressions. In particular, we re-evaluate

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

on the GRIT referring expression ablation set while replacing the prompt used during training (first row in the table) with a paraphrase (following rows). Results are shown in Table 7.

Overall, we find that the model has some capacity to generalize to paraphrases of the prompt (e.g, row 3 works reasonably well despite using completely different words), but there are paraphrases that result in very significant performance decrease (e.g rows 5, 6, and 8). We also find removing the spaces around the punctuation sometimes results in minor regressions (row 0 vs row 1) and sometimes in sharply reduced performance (row 6 vs row 7), showing

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

can be sensitive to formatting details. We hypothesize that this caused by the SentencePiece tokenizer changing the tokenization of the referring expressing if the quotes are not separated from it by spaces. Building multi-task models that can generalize to different prompts, and ideally to prompts for completely new tasks, is an exciting avenue for future work.

4.6 Limitations

For object detection, while

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

generally produces accurate outputs (see Appendix A.4), we find the recall is often poor in cluttered images. Prior work ([35]) has shown this can be overcome with extensive data augmentation techniques, but these methods are not currently integrated into

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

. Our use of a pre-trained VQ-GAN greatly simplifies our training and is surprisingly effective for dense prediction tasks. However, it does mean

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

has limited image generation capabilities (recent works ([66]) have shown this method can be greatly improved but was not available at the time of development). We also found in a small-scale study that our model does not always understand prompts not in the training data (see Section 4.5).

5. Related Work

Show me a brief summary.

In this section, constructing versatile models for diverse vision, vision-language, and NLP tasks drives the narrative, evolving from task-specific heads atop shared backbones to unified architectures that tokenize images, masks, boxes, and keypoints into discrete sequences for autoregressive processing. Vision-language pre-training via masked modeling, contrastive losses, and detection datasets paves the way, with recent unified models like Flamingo, Gato, Perceiver-IO, OFA, UViM, and Pix2Seq v2 extending to interleaved inputs or specific outputs but falling short on dense predictions or full multi-task breadth. Unified-IO emerges superior by unifying all these modalities and 90+ datasets in joint training, enabling massive heterogeneous task support without fine-tuning.

Vision and language pre-training has become standard practice for multi-modal models, including unified and non-unified models requiring task-specific heads to train from scratch during fine-tuning. Many initial pre-training strategies were inspired by BERT ([37]) and included masked-language-modeling, image-text-matching, or mask-region-modeling objectives, often supplemented with objectives using the predictions of a strong object detector model (e.g, VILBERT ([4]), LXMERT ([6]), VisualBERT ([5])). More recently contrastive-image-text losses ([43, 67, 68]) or auto-regressive generation losses ([41, 69, 59]), have become common. Several works have also directly used object detection or segmentation datasets for pre-training [70, 10, 71]. The generalized masked-data-modeling pretraining objective used in

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is similar to ones used in several recent works ([72, 73, 74]).

Constructing models that can learn to solve many different tasks has been of long-standing interest to researchers. A traditional approach to this problem is to build models with task-specialized heads on top of shared backbones ([3, 75, 38]). However, this requires manually designing a specialized head for each task, potentially limiting transfer across tasks. An alternative is to build unified models – models that can complete many different tasks without task-specialized components. In NLP, this approach has achieved great success using pre-trained generative models ([1, 2, 76]).

Inspired by this success, there has been a recent trend to build unified models that can apply to tasks with visual or structured inputs and outputs. Many models have been proposed for tasks with text and/or image input and text output ([8, 41, 67, 77, 78, 71, 79, 72]). However, these models can not produce any structured or visual output.

More recent unified models can additionally support image locations, which allows tasks like object detection or region captioning. This can be done by using bounding boxes proposed by an object detector ([8, 45]) or including a bounding box output head ([9, 80, 81, 82, 83]). Alternatively, image locations can be encoded as special tokens in the input/output text ([84, 85, 86]) following [35].

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

follows this design but applies it to a wider set of tasks than previous works, including keypoint estimation, image in-painting, and region captioning.

Some recent unified models have extended these capabilities in other directions ([87, 88, 69, 60, 89, 90, 91]). Gato ([90]) supports additional modalities, including button presses in Atari games or joint torques for a robot arm, and Flamingo ([60]) supports interleaved sequences of text, images, and videos as input. However, neither of these models support image outputs or image location references limiting the computer vision tasks they can support. Perceiver-IO ([89]) supports a range of modalities and proposes a non-auto-regressive decoding approach using task-specific latent query vectors. While effective for some tasks, this method is not as effective as auto-regressive decoding on classic generative tasks like captioning or image generation. Uni-Perceiver ([92]) also supports images, text, and videos and shows good zero-shot performance but does not support generative tasks.

Concurrent to our work, OFA ([10]) proposes a similar approach that also supports image locations and text-to-image synthesis. However, OFA does not support dense labeling tasks such as depth estimation, segmentation, and surface normal estimation. Other closely related models include UViM ([58]) which generates a discrete guiding code for a

D-VAE\text{D{\scriptsize -VAE}}

to build an autoregressive model for panoptic segmentation, depth prediction, and colorization, and Pix2Seq v2 ([81]) which extends Pix2Seq to segmentation, keypoint estimation, and image captioning.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

covers all these tasks and more and focuses on multi-tasking rather than task-specific fine-tuning.

6. Conclusion

Show me a brief summary.

In this section, the challenge of unifying diverse computer vision and NLP tasks with multimodal inputs and outputs—such as images, continuous maps, masks, text, bounding boxes, and keypoints—is addressed through a single architecture. Unified-IO homogenizes all modalities into discrete token sequences for joint processing. Its 2.9B-parameter XL model, trained on over 90 datasets, becomes the first to complete all seven GRIT benchmark tasks while achieving strong results across 16 other vision and NLP benchmarks, without fine-tuning or task-specific changes.

We have presented

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

, a unified architecture that supports a large variety of computer vision and NLP tasks with diverse inputs and outputs, including images, continuous maps, binary masks, segmentation masks, text, bounding boxes, and keypoints. This unification is made possible by homogenizing each of these modalities into a sequence of discrete tokens. The 2.9B parameter

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

XL model is jointly trained on 90+ datasets, is the first model to perform all 7 tasks on the GRIT benchmark and obtains impressive results across 16 other vision and NLP benchmarks, with no benchmark fine-tuning or task-specific modifications.

Acknowledgements

Show me a brief summary.

In this section, developing Unified-IO hinged on vital computational resources and expert collaboration. Google's TPU Research Cloud supplied essential Cloud TPUs, facilitated by Zak Stone and the Google Cloud TPU team granting machine access for experiments, while Keisuke Sakaguchi delivered early project feedback and Amita Kamath executed the GRIT referring expression analysis. The ReVIZ team at Allen Institute for AI—Sam Stuesser, Sam Skjonsberg, Jon Borchardt, Carissa Schoenick, and Michael Schmitz—built the demo site at unified-io.allenai.org. These contributions collectively powered the model's training, evaluation, and public showcase.

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). We thank Zak Stone and the Google Cloud TPU team for providing access to the TPU machines used for conducting experiments and Keisuke Sakaguchi for providing early feedback on the project. We thank Amita Kamath for conducting the GRIT referring expression experiment using different prompts. We also thank the ReVIZ team at the Allen Institute for AI, including Sam Stuesser, Sam Skjonsberg, Jon Borchardt, Carissa Schoenick and Michael Schmitz for helping setup the demo website:

Appendix

A.1 Tasks Details

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is jointly trained on a large and diverse set of vision, language and vision & language tasks. In this section, we describe these tasks in detail and show the prompts we use during training and inference (text on the left of example cards). We also provide qualitative examples of both the ground truth and the predictions made by

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

A.1.1 Image Synthesis Tasks

Image Synthesis from Text. This task requires generating an image that matches a sentence. Training data comes from 4 captioning datasets: COCO Caption ([24]), Conceptual Captions 3M and 12M ([25]), and RedCaps ([13]) as well datasets used for image classification using the object class as the input caption. Specialized image generation models like DALL·E 2 ([93]) use an order of magnitude more data, but we limit our sources to these sets for training efficiency.

Image Inpainting. This task requires filling in a region of an image with a target object. Training data for this task is built from object bounding box annotations from Open Images ([16]), Visual Genome ([14]) and COCO ([18]). For each object, the input image becomes the source image with the object's bounding box blanked out. The input prompt provides the bounding box's location and the target category. The target output is the original image.

Image Synthesis from Segmentation. This task involves generating an image that matches an input semantic segmentation, i.e., a set of class labels for some or all of the pixels in the image.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained for this task using segmentation annotations from COCO ([18]), Open Images ([16]), and LVIS ([15]) as input. Following the method from Section 3.1 the segmentation input is converted into a RGB image paired with a prompt listing the color-to-class mapping, and the target output is the source image.

A.1.2 Sparse Labelling Tasks

Object Detection.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained on object detection annotations from Visual Genome, Open Images, and COCO. For this task the input is a static prompt and an image, and the output text includes the bounding boxes and class names of all objects in the image. We randomize the order of the output objects during training, but for simplicity leave integrating more complex data-augmentation techniques ([35]) to future work.

Object Localization. Object localization requires returning bounding boxes around all objects of a given category. Training data is derived from our object detection training data by constructing a training example from each category of objects present in an image. The input is then the image, a prompt specifying the target class, and the output is a list of all boxes that contain an instance of that class. The class for each box (which is always the class specified in the prompt) is included in the output for the sake of keeping the output format consistent with the object detection output. Object localization can use input categories which are not present in the image. To handle this, we construct negative samples by randomly selecting categories not present in the image to use as input, in which case the output is an empty sequence.

Referring Expression Comprehension. The task requires the model to localize an image region described by a natural language expression. The annotation is similar to Object Localization, except that the target is specified with natural language expression instead of class name. Datasets for this task include RefCOCO ([19]), RefCOCO+ ([19]) and RefCOCOg ([94]).

Keypoint Estimation. Keypoint estimation requires returning the location of 17 keypoints on a human body (e.g., eyes, nose, feet, etc.) for each person in an image. While it is possible to perform this task in one pass by listing the keypoints of all people in the image in a single output sequence, this can result in an extremely long output sequence, so

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

uses a multi-step approach instead. To do this

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained to complete the subtask of detecting the keypoints for single a person in a given region. For this subtask, the input prompt specifies the target region and and the output is a list of 17 points (a pair of locations tokens for the

x

and

y

coordinates) along with a visibility labels (1 for not visible, 2 for partly visible, 3 for fully visible). Non-visible points are preceded by two copies of a new special tokens that indicate there are no valid coordinates. The keypoint metric does not award points for correctly identifying non-visible points, so during inference we mask that special token so the model makes a best-effort guess for the coordinates of every single point. Training data for this subtask comes from COCO human pose data ([18]) with the ground-truth person regions as input. During inference we locate person regions using the object localization prompt, then apply

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

again to find keypoints for each detected region.

A.1.3 Dense Labelling Tasks

Object Segmentation. Object segmentation requires finding the binary segmentation mask of each instance of a particular category in an image. The input is an image and a prompt that includes the target class, while the output is an RGB image with black background and instances of that class filled in with unique colors following the method in Section 3.1. The output image is resized to match the input image if needed using a nearest-neighbor resizing method, and binary masks are built from each unique color. In practice the output image from

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

can have slightly non-uniform colors or extraneous background pixels, likely due to limitation in what the

D-VAE\text{D{\scriptsize -VAE}}

can decode/encode, so the output pixels are clustered by color and and connected components of less than 8 pixels are removed to build cleaned instance masks. Segmentation annotations come from Open Images, LVIS, and COCO.

Depth Estimation. Depth estimation requires assigning each pixel in an image a depth value. This task uses a static prompt as input, and the output is a grayscale image representing the normalized depth at each pixel. The generated output image is reiszed to the same size as the input image and then pixel values are rescaled to the maximum depth in the training to get an output depth map. Training data comes from the NYU Depth Dataset V2 ([20]).

Surface Normal Estimation.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained on FrameNet ([95]) and BlendedMVS ([96]) surface normal estimation datasets. For this task the input is a static prompt and an image and the output is an RGB representation of the

x / y / z

orientation of the surface at each pixel. The generated output image is resized to match the input image and converted back to

x / y / z

orientations to produce the final output.

A.1.4 Image Classification Tasks

Image Classification.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained on 6 image classification datasets: ImageNet 2012 ([22]), ImageNet21k ([42]), Places365 ([46]), Sun397 ([97]), iNaturalist ([98]) and Caltech birds 2011 ([99]). For this task the input is an image and a static prompt, and the output is a class name. During inference we compute the log-probability of each class label in the dataset being evaluated and return the highest scoring one. This ensures

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

does not return a category from a different categorization dataset that is a synonym or hypernym of the correct label.

Object Categorization. This task identifies which label, from a given set, best corresponds to an image region defined by an input image and bounding box. The input is the image, a prompt specifying the image region and the output is the target class name. We convert object detection annotations from Visual Genome, Open Images, and COCO for this task. Inference is constrained to return a valid label for the target label set just as with image classification.

A.1.5 Image Captioning Tasks

Image Captioning. Image captioning data comes from the same manually annotated and unsupervised sources used for Image Generation. In this case the inputs and output are reversed, the input is an image and the static prompt, and the output is a caption that matches the image.

Region Captioning. Region captioning tasks a model with generating a caption that describes a specific region in the image. Our format for this task is identical to Image Captioning except the region is included in the input prompt. Visual Genome ([14]) is used for the training data.

A.1.6 Vision & Language Tasks

Visual Question Answering.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained on a collection of VQA datasets including VQA 2.0 ([47]), Visual Genome, VizWizVQA ([49]), OKVQA ([100]) and A-OKVQA ([48]). For VQA, the question is used as the prompt, and the output is the answer text. For VQA, it is common to constrain the model to predict an answer from a fixed last of common VQA answers ([10, 41]) during inference, but we avoid doing this since we find it does not benefit

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

in practice.

We additionally convert data from several other datasets in a VQA format, including imSitu ([101]), where we treat predicting the verb and then the related slots as separate VQA questions, VisualCOMMET ([52]) where we convert the before/after/intent into questions by converting the input regions into location tokens, SNLI-VE ([51]) where we integrate the entailed text into an input question, and VCR ([102]) where we again integrate the input regions into the prompt by encoding them with location tokens and integrate the rationales into the target text for the answer justification task.

Answer-Grounded Visual Question Answering. This task requires both answering a question and returning a binary mask specifying the region of the image used to answer the question. The format for this task follows the one for VQA except that a binary mask is also used as an additional output. Training data comes from VizWiz-VQA ([28]), a dataset designed to train models that could benefit people with visual impairments.

Relationship Detection. This task requires predicating a relationship between a pair of objects which are grounded by bounding boxes. The prompt contains both the object regions, and the output is the predicted predicate. There are 2 datasets in this tasks: Visual Genome ([14]) and Open Images ([16]).

A.1.7 Natural Language Processing Tasks

Question Answering. Following prior work in natural language processing ([1]), QA tasks are formatted by placing both the question and any text context (e.g., an paragraph containing the answer) into the prompt and training the model to generate the text answer.

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

is trained on several QA datasets including SQuAD 2.0 ([30]), other training datasets from the MRQA ([103]) shared tasks ([104, 105, 106, 107, 108]), QA datasets from SuperGLUE ([109, 55, 110, 111]), Cosmos QA ([112]), OpenBookQA ([113]), and HellaSwag ([114]). If the text context is longer then our maximum sequence length we use a sliding-window approach following [37] which exposes the model to different windows of text from the context and returns the highest-confidence answer.

Text Classification. Also following past work ([1]), text classification tasks are formatted by placing the input sentences and a query in the prompt and training the model to generate the target class. Datasets include tasks from GLUE and SuperGLUE ([115, 109, 116, 117, 54, 118, 119, 29, 120, 121, 122, 123, 124, 29, 125, 126]), as well as SNLI ([127]), SciTail ([56]), IMDB Reviews ([128]), and PAWS ([129]).

Text Summerization. Text summarization is done again by providing the input paragraph and a prompt as input and generating a summary as output. We use the Gigaword dataset ([31, 130]) for training data.

A.1.8 Language Modeling Tasks

Mask Language Modeling. Following T5 ([1]), the mask language modelling objective randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. The target is to recover the dropped tokens given the sentinel token. We use C4 ([1]) and Wikipedia ([32]) datasets.

A.2 Pre-Training Data Distribution

Figure 3 shows a visualization of pre-training data distribution used by

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

. As discussed in Section 3.3, we equally sample data with the text denoising and image denoising objective (inner circle of Figure 3). For text denoising, half of the samples are from pure text data, i.e C4 and Wikipedia. The other half is constructed from image and class, such as Imagenet21k ([42]) or image and caption, such as YFCC15M ([43]). For image denoising, we use the text information when class and caption are present in the data source and sample the dataset proportional to the dataset size. For both text and image denoising, we random drop both modalities with 10% of the time if both text and image as inputs.

A.3 Multi-Tasking Data Distribution

Figure 4 shows a visualization of the multi-task training distribution used by

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

from Table 1. As discussed in Section 3.3, we equally sample each group (

1/8

) except image synthesis (

3/16

) and dense labeling (

1/16

) since dense labeling has a much smaller sample size compared to image synthesis. We sample tasks and datasets (middle and outer circle) with a temperature-scaled mixing strategy to make sure the model is sufficiently exposed to underrepresented tasks. We raise each task’s mixing rate to the power of

1/ T

and then renormalize the rates so that they sum to 1. Following [1], we use

T = 2

in our experiments.

Due to the large variance in dataset size, some of the tasks are rarely sampled. For example, the depth estimation task only has the NYU Depth dataset source ([20]) and thus the sampling rate is only

0.43%0.43\%

. However, the model still works well for depth estimation tasks, even outperforming concurrent work ([58]) (0.385 vs 0.467 RMSE). We suspect the large model capacity and masked image denoising pre-training improves the performance. Similarly, Grounding VQA ([28]) has

0.15%0.15\%

sample rate, but the model can still achieve state-of-the-art performance on this task partly because it is trained on many related datasets for VQA and segmentation.

A.4 Qualitative Examples

Here we present qualitative examples of predictions from

UNIFIED-IO\text{U{\scriptsize NIFIED-IO}}

for all training tasks. For brevity, if prompts are identical for each example we only show the prompt once, and if the prompt follows the same template for each example we show the template with parts that would be substituted with different words or location tokens underlined, and then show just the substitution with individual examples.

References

[1] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.

[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020.

[3] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017.

[4] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.

[5] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. In arXiv, 2019.

[6] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.

[7] Ronghang Hu and Amanpreet Singh. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1439–1449, 2021.

[8] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021.

[9] Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem. Towards general purpose vision systems: An end-to-end task-agnostic vision-language architecture. In CVPR, 2022a.

[10] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv, 2022b.

[11] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021.

[12] Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, and Derek Hoiem. GRIT: General robust image task benchmark. arXiv, 2022b.

[13] Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. RedCaps: Web-curated image-text data created by the people, for the people. arXiv, 2021.

[14] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.

[15] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.

[16] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.

[17] Anthony D Rhodes, Max H Quinn, and Melanie Mitchell. Fast on-line kernel density estimation for active object localization. In 2017 international joint conference on neural networks (IJCNN), pp.\ 454–462. IEEE, 2017.

[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

[19] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.

[20] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from RGBD images. In ECCV, 2012.

[21] Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In ICCV, 2021.

[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.

[23] Axel Pinz et al. Object categorization. Foundations and Trends® in Computer Graphics and Vision, 1(4):255–353, 2006.

[24] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv, 2015.

[25] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.

[26] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.

[27] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, 2016.

[28] Chongyan Chen, Samreen Anjum, and Danna Gurari. Grounding answers for visual questions asked by visually impaired people. In CVPR, 2022a.

[29] Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.\ 1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.

[30] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.

[31] David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.

[32] Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.

[33] Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP, 2018.

[34] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv, 2018.

[35] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022b.

[36] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[37] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.

[38] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, D. Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In CVPR, 2020.

[39] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. In ICLR, 2022.

[40] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.

[41] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In ICLR, 2022d.

[42] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. arXiv, 2021.

[43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.

[44] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp.\ 4596–4604. PMLR, 2018.

[45] Amita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, and Aniruddha Kembhavi. Webly supervised concept expansion for general purpose vision models. arXiv, 2022.

[46] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017.

[47] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.

[48] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. arXiv, 2022.

[49] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.

[50] Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, and Aniruddha Kembhavi. Grounded situation recognition. In ECCV, 2020.

[51] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for fine-grained image understanding. arXiv, 2019.

[52] Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, and Yejin Choi. Visualcomet: Reasoning about the dynamic context of a still image. In In Proceedings of the European Conference on Computer Vision (ECCV), 2020.

[53] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In ICCV, 2019.

[54] William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In IWP, 2005.

[55] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.

[56] Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering. In AAAI, 2018.

[57] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv, 2022e.

[58] Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, and Neil Houlsby. Uvim: A unified modeling approach for vision with learned guiding codes. arXiv, 2022.

[59] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.

[60] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. arXiv, 2022.

[61] Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, and Jianfeng Gao. Kat: A knowledge augmented transformer for vision-and-language. arXiv, 2021.

[62] Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, and Rita Cucchiara. Universal captioner: long-tail vision-and-language model training through content-style separation. arXiv, 2021.

[63] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zheng, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv, 2022.

[64] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. Designing effective sparse expert models. arXiv, 2022.

[65] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced bert with disentangled attention. In ICLR, 2021.

[66] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022b.

[67] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.

[68] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.

[69] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.

[70] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.

[71] Benyuan Sun, Jin Dai, Zihao Liang, Congying Liu, Yi Yang, and Bo Bai. Gppf: A general perception pre-training framework via sparsely activated multi-task learning. arXiv preprint arXiv:2208.02148, 2022.

[72] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022c.

[73] Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers. ArXiv, abs/2208.06366, 2022.

[74] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15638–15650, 2022.

[75] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4487–4496. Association for Computational Linguistics, 2019. URL https://www.aclweb.org/anthology/P19-1441.

[76] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

[77] Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021.

[78] Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137, 2017.

[79] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022d.

[80] Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, et al. Coarse-to-fine vision-language pre-training with fusion in the backbone. arXiv preprint arXiv:2206.07643, 2022.

[81] Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey Hinton. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022c.

[82] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1780–1790, 2021.

[83] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10965–10975, 2022d.

[84] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Unitab: Unifying text and box outputs for grounded vision-language modeling. 2021.

[85] Yuan Yao, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Pevl: Position-enhanced pre-training and prompt tuning for vision-language models. arXiv preprint arXiv:2205.11169, 2022.

[86] Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. Seqtr: A simple yet universal network for visual grounding. arXiv preprint arXiv:2203.16265, 2022a.

[87] Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022c.

[88] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022a.

[89] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J Henaff, Matthew Botvinick, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver io: A general architecture for structured inputs & outputs. In ICLR, 2022.

[90] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. arXiv, 2022.

[91] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Shengtong Mo, Dani Yogatama, Louis-Philippe Morency, and Ruslan Salakhutdinov. Highmmt: Towards modality and task generalization for high-modality representation learning. arXiv preprint arXiv:2203.01311, 2022.

[92] Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Hongsheng Li, Xiaohua Wang, and Jifeng Dai. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In CVPR, 2022b.

[93] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv, 2022.

[94] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.

[95] Jingwei Huang, Yichao Zhou, Thomas Funkhouser, and Leonidas J Guibas. Framenet: Learning local canonical frames of 3d surfaces from a single rgb image. In ICCV, 2019a.

[96] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In CVPR, 2020.

[97] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.

[98] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In CVPR, 2018.

[99] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.

[100] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[101] Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5534–5542, 2016.

[102] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6720–6731, 2019a.

[103] Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. Mrqa 2019 shared task: Evaluating generalization in reading comprehension. arXiv preprint arXiv:1910.09753, 2019.

[104] Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pp.\ 191–200, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-2623. URL https://aclanthology.org/W17-2623.

[105] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.

[106] Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179, 2017.

[107] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.

[108] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi:10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.

[109] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, 2019.

[110] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface:a challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), 2018.

[111] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.

[112] Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. arXiv, 2019b.

[113] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.

[114] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019b.

[115] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv, 2018.

[116] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.

[117] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.\ 1631–1642, 2013.

[118] Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017. URL https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.

[119] Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.

[120] Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp.\ 177–190. Springer, 2005.

[121] Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second pascal recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment, volume 6, pp.\ 6–4. Venice, 2006.

[122] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp.\ 1–9. Association for Computational Linguistics, 2007.

[123] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.

[124] Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.

[125] Marie-Catherine De Marneff, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. proceedings of Sinn und Bedeutung 23, 2019.

[126] Mohammad Taher Pilehvar and os'e Camacho-Collados. Wic: 10, 000 example pairs for evaluating context-sensitive representations. CoRR, abs/1808.09121, 2018. URL http://arxiv.org/abs/1808.09121.

[127] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. The SNLI corpus. 2015.

[128] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.

[129] Yuan Zhang, Jason Baldridge, and Luheng He. PAWS: Paraphrase Adversaries from Word Scrambling. In Proc. of NAACL, 2019.

[130] Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015. doi:10.18653/v1/d15-1044. URL http://dx.doi.org/10.18653/v1/D15-1044.