Visual Instruction Tuning

Show me an executive summary.

1) Purpose

Researchers aimed to create an open-source AI model that understands images and text together and follows user instructions, like a visual assistant. Current vision models handle fixed tasks but lack flexibility to adapt to varied instructions. Large language models excel at text instructions, so the work extends this idea to vision-language tasks. This matters now due to recent successes in text-based chatbots and the need for general-purpose multimodal AI.

2) Approach

The team used GPT-4 to generate 158,000 instruction-following examples from public images, covering chats, descriptions, and reasoning. They linked a pre-trained image encoder (CLIP) to an open language model (Vicuna) with a simple projection layer. Training occurred in two stages: first align image features to text on 595,000 filtered image-text pairs; then fine-tune end-to-end on the new data for chat or science questions.

3) Key results

- LLaVA matched behaviors of proprietary GPT-4 on unseen images in visual chats, outperforming rivals like BLIP-2 and OpenFlamingo. - On a new benchmark with COCO images, it scored 85% relative to GPT-4 using ground-truth image details. - On diverse real-world images, it reached 67% relative score, far above competitors (BLIP-2 at 38%, OpenFlamingo at 19%). - Alone, it hit 91% accuracy on ScienceQA questions; combined with GPT-4 as a judge, it set a new record of 92.5%.

4) Interpretation

Results show instruction tuning works for vision-language tasks, enabling the model to follow specific prompts rather than just describe images. It handles complex reasoning, counts objects, and links visuals to knowledge. Open-sourcing data, code, and models speeds research. Training finishes in hours on standard hardware, making it accessible.

5) What the findings mean

Strong results boost performance on chat and question-answering tasks, cutting need for proprietary tools like GPT-4. This lowers costs for developers and shortens timelines to deploy visual assistants. Safety improves with built-in filters for harmful text or images, though risks like biases persist from base models. It outperforms prior open models, signaling a shift to instruction-tuned multimodal AI.

6) Recommendations and next steps

Use LLaVA as a base for visual chatbots or QA systems; fine-tune on domain data for specific needs. Explore ensembling with GPT-4 for top accuracy. Next, scale to larger models or datasets, test on more benchmarks, and refine for real-world apps like navigation or editing. If building production systems, run pilots on user tasks first.

7) Limitations and confidence

Limited training data may miss edge cases; model can hallucinate details or inherit biases from CLIP and Vicuna. Evaluation relies partly on GPT-4 judging, which assumes its reliability. High confidence in benchmark gains and chat demos; caution on novel domains or high-stakes use without more tests.

Haotian Liu

^{1*}

, Chunyuan Li

^{2*}

, Qingyang Wu

^{3}

, Yong Jae Lee

^{1}

^{1}

University of Wisconsin–Madison

^{2}

Microsoft Research

^{3}

Columbia University

https://llava-vl.github.io

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.

1 Introduction

Show me a brief summary.

In this section, humans' multimodal interaction via vision and language inspires the quest for a general-purpose AI assistant that follows diverse real-world instructions, yet current vision models remain task-specific with fixed interfaces, while language-only LLMs excel at instruction-following but ignore visuals. Visual instruction-tuning bridges this gap by using GPT-4 to generate vision-language data from image-text pairs, powering LLaVA—a large multimodal model that connects CLIP's visual encoder to Vicuna's LLM and fine-tunes end-to-end. This yields impressive chat abilities rivaling GPT-4, state-of-the-art Science QA accuracy via ensembling, new benchmarks like LLaVA-Bench, and open-sourced data, models, and demos to advance the field.

Humans interact with the world through many channels such as vision and language, as each individual channel has a unique advantage in representing and communicating certain concepts, and thus facilitates a better understanding of the world. One of the core aspirations in artificial intelligence is to develop a general-purpose assistant that can effectively follow multi-modal vision-and-language instructions, aligned with human intent to complete various real-world tasks in the wild [1, 2, 3].

To this end, the community has witnessed an emergent interest in developing language-augmented foundation vision models [2, 4], with strong capabilities in open-world visual understanding such as classification ([5, 6, 7, 8, 9]), detection [10, 11, 12], segmentation ([13, 14, 15]) and captioning ([16, 17]), as well as visual generation and editing ([18, 19, 20, 21, 22, 23]). We refer readers to the Computer Vision in the Wild reading list for a more up-to-date literature compilation ([24]). In this line of work, each task is solved independently by one single large vision model, with the task instruction implicitly considered in the model design. Further, language is only utilized to describe the image content. While this allows language to play an important role in mapping visual signals to language semantics—a common channel for human communication, it leads to models that usually have a fixed interface with limited interactivity and adaptability to the user's instructions.

Large language models (LLM), on the other hand, have shown that language can play a wider role: a universal interface for a general-purpose assistant, where various task instructions can be explicitly represented in language and guide the end-to-end trained neural assistant to switch to the task of interest to solve it. For example, the recent success of ChatGPT ([25]) and GPT-4 ([26]) have demonstrated the power of aligned LLMs in following human instructions, and have stimulated tremendous interest in developing open-source LLMs. Among them, LLaMA ([27]) is an open-source LLM that matches the performance of GPT-3. Alpaca ([28]), Vicuna ([29]), GPT-4-LLM ([30]) utilize various machine-generated high-quality instruction-following samples to improve the LLM's alignment ability, reporting impressive performance compared with proprietary LLMs. Importantly, this line of work is text-only.

In this paper, we present visual instruction-tuning, the first attempt to extend instruction-tuning to the language-image multimodal space, to pave the way towards building a general-purpose visual assistant. In particular, our paper makes the following contributions:

Multimodal instruction-following data. One key challenge is the lack of vision-language instruction-following data. We present a data reformation perspective and pipeline to convert image-text pairs into an appropriate instruction-following format, using ChatGPT/GPT-4.
Large multimodal models. We develop a large multimodal model (LMM), by connecting the open-set visual encoder of CLIP ([5]) with the language decoder Vicuna [29], and fine-tuning end-to-end on our generated instructional vision-language data. Our empirical study validates the effectiveness of using generated data for LMM instruction-tuning, and suggests practical tips for building a general-purpose instruction-following visual agent. When ensembled with GPT-4, our approach achieves SoTA on the Science QA [31] multimodal reasoning dataset.
Multimodal instruction-following benchmark. We present LLaVA-Bench with two challenging benchmarks, with a diverse selection of paired images, instructions and detailed annotations.
Open-source. We release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo.

2 Related Work

Show me a brief summary.

In this section, developing multimodal instruction-following agents remains fragmented, split between end-to-end models for narrow tasks like navigation or image editing and LLM-orchestrated systems like Visual ChatGPT. Instruction tuning, proven in NLP to boost LLMs' zero-shot generalization via explicit human-like directives, inspires vision-language extension, yet prior large multimodal models such as Flamingo, BLIP-2, OpenFlamingo, and LLaMA-Adapter—trained mainly on image-text pairs—lack targeted vision-language instruction data, yielding weaker multimodal than language-only performance. The paper bridges this void through visual instruction tuning, distinct from parameter-efficient visual prompt tuning, to enhance broad instruction adherence.

Show me a brief summary.

In this section, the challenge of curating balanced, diverse image-text data for multimodal pre-training is tackled through varied prompting strategies and dataset filtering. Brief and detailed image descriptions are generated using lists of natural-language-equivalent instructions (Tables 11–12) to enrich captions via GPT models. For CC3M, noun-phrases are extracted with Spacy, rare ones (frequency <3) discarded, and captions iteratively added starting from lowest-frequency phrases, capping subsets at 100 for high-frequency ones to prioritize tail concepts. This yields a 595K-pair subset with superior coverage of low-frequency concepts versus the original, as visualized in noun-phrase frequency statistics.

Multimodal Instruction-following Agents. In computer vision, existing works that build instruction-following agents can be broadly categorized into two classes:

(i)

End-to-end trained models, which are separately explored for each specific research topic. For example, the vision-language navigation task [32, 33] and Habitat [34] require the embodied AI agent to follow natural language instructions and take a sequence of actions to complete goals in visual environments. In the image editing domain, given an input image and a written instruction that tells the agent what to do, InstructPix2Pix [35] edits images by following the human instructions.

(ii)

A system that coordinates various models via LangChain [36] / LLMs [25], such as Visual ChatGPT [37], X-GPT [14], MM-REACT [38], VisProg [39], and ViperGPT [40]. While sharing the same goal in building instruction-following agents, we focus on developing an end-to-end trained language-vision multimodal model for multiple tasks.

Instruction Tuning. In the natural language processing (NLP) community, to enable LLMs such as GPT-3 [41], T5 [42], PaLM [43], and OPT [44] to follow natural language instructions and complete real-world tasks, researchers have explored methods for LLM instruction-tuning ([45, 46, 47]), leading to instruction-tuned counterparts such as InstructGPT [45]/ChatGPT [25], FLAN-T5 [48], FLAN-PaLM [48], and OPT-IML [49], respectively. It turns out that this simple approach can effectively improve the zero- and few-shot generalization abilities of LLMs. It is thus natural to borrow the idea from NLP to computer vision. More broadly, the teacher-student distillation ideas with foundation models have been studied in other topics such as image classification [50]. Flamingo [51] can be viewed as the GPT-3 moment in the multimodal domain, due to its strong performance on zero-shot task transfer and in-context-learning. Other LMMs trained on image-text pairs include BLIP-2 [17], FROMAGe [52], and KOSMOS-1 ([53]). PaLM-E [54] is an LMM for embodied AI. Based on the recent "best" open-source LLM LLaMA, OpenFlamingo ([55]) and LLaMA-Adapter ([56]) are open-source efforts that enable LLaMA to use image inputs, paving the way to build open-source multimodal LLMs. While these models present promising task transfer generalization performance, they are not explicitly tuned with vision-language instruction data, and their performance in multimodal tasks usually falls short compared to language-only tasks. In this paper, we aim to fill this gap and study its effectiveness. Finally, note that visual instruction tuning is different from visual prompt tuning [57]: the former aims to improve the model's instruction-following abilities, while the latter aims to improve the parameter-efficiency in model adaptation.

3 GPT-assisted Visual Instruction Data Generation

Show me a brief summary.

In this section, the scarcity of multimodal instruction-following data despite abundant image-text pairs poses a key challenge for training visual assistants. To address it, GPT-4 is prompted with symbolic representations—captions and bounding boxes from COCO images—to generate diverse instruction-response pairs via in-context learning from seed examples, yielding three types: conversations probing objects, actions, and positions; detailed descriptions curated from targeted question lists; and complex reasoning requiring step-by-step logic. This pipeline produces 158K unique language-image samples (58K conversations, 23K descriptions, 77K reasoning), with ablations confirming GPT-4's superiority over ChatGPT for high-quality outputs like spatial reasoning.

The community has witnessed a surge in the amount of public multimodal data such as image-text pairs, ranging from CC [58] to LAION [59]. However, when it comes to multimodal instruction-following data, the available amount is limited, partially because the process for creating such data is time-consuming and less well-defined when human crowd-scouring is considered. Inspired by the success of recent GPT models in text-annotation tasks [60], we propose to leverage ChatGPT/GPT-4 for multimodal instruction-following data collection, based on the widely existing image-pair data.

For an image

Xv\mathbf{X}_{\text{v}}

and its associated caption

Xc\mathbf{X}_{\text{c}}

, it is natural to create a set of questions

Xq\mathbf{X}_{\text{q}}

with the intent to instruct the assistant to describe the image content. We prompt GPT-4 to curate such a list of questions (see details in Appendix). Therefore, a simple way to expand an image-text pair to its instruction-following version is

Human:XqXv<STOP>Assistant:Xc<STOP>\texttt{Human}: \mathbf{X}_{\text{q}} \mathbf{X}_{\text{v}} \texttt{<STOP>} \texttt{Assistant}: \mathbf{X}_{\text{c}} \texttt{<STOP>}

. Though cheap to construct, this simple expanded version lacks diversity and in-depth reasoning in both the instructions and responses.

To mitigate this issue, we leverage language-only GPT-4 or ChatGPT as the strong teacher (both accept only text as input), to create instruction-following data involving visual content. Specifically, in order to encode an image into its visual features to prompt a text-only GPT, we use two types of symbolic representations:

(i)

Captions typically describe the visual scene from various perspectives;

(ii)

Bounding boxes usually localize the objects in the scene, and each box encodes the object concept and its spatial location. One example is shown in the top block of Table 14.

This symbolic representation allows us to encode the image as an LLM-recognizable sequence. We use COCO images [61] and generate three types of instruction-following data. One example per type is shown in the bottom block of Table 14. For each type, we first manually design a few examples. They are the only human annotations we have during data collection, and are used as seed examples in in-context-learning to query GPT-4.

Conversation. We design a conversation between the assistant and a person asking questions about this photo. The answers are in a tone as if the assistant is seeing the image and answering the question. A diverse set of questions are asked about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects. Only questions that have definite answers are considered. Please see Appendix for the detailed prompt.
Detailed description. To include a rich and comprehensive description for an image, we create a list of questions with such an intent. We prompt GPT-4 then curate the list (see detailed prompts and curation process in Appendix). For each image, we randomly sample one question from the list to ask GPT-4 to generate the detailed description.
Complex reasoning. The above two types focus on the visual content itself, based on which we further create in-depth reasoning questions. The answers typically require a step-by-step reasoning process by following rigorous logic.

We collect 158K unique language-image instruction-following samples in total, including 58K in conversations, 23K in detailed description, and 77k in complex reasoning, respectively. We ablated the use of ChatGPT and GPT-4 in our early experiments, and found that GPT-4 consistently provides higher quality instruction-following data, such as spatial reasoning.

4 Visual Instruction Tuning

Show me a brief summary.

In this section, visual instruction tuning tackles the challenge of linking a pre-trained vision encoder and large language model to enable general-purpose multimodal instruction-following. It connects CLIP ViT-L/14 visual features—extracted as grid representations—to Vicuna's word embedding space via a simple trainable linear projection layer, producing visual tokens inserted into multi-turn conversation sequences, with images randomly placed before or after the first question prompt. Training uses a two-stage process: initial pre-training on 595K filtered CC3M image-caption pairs aligns features by optimizing only the projection while freezing encoders; subsequent end-to-end fine-tuning optimizes the projection and LLM on 158K generated instruction data for chatbot dialogues or Science QA tasks, yielding models with strong zero-shot generalization to visual reasoning and chat.

Show me a brief summary.

In this section, visual instruction tuning addresses the challenge of integrating a pre-trained vision encoder with a large language model to enable effective multimodal instruction-following. It employs a simple linear projection layer to map CLIP ViT-L/14 image features into the Vicuna LLM's word embedding space, forming visual tokens for autoregressive training on multi-turn conversation sequences. A two-stage process first aligns features via pre-training solely on the projection matrix using filtered CC3M captions, then fine-tunes the projection and LLM end-to-end: for a multimodal chatbot on 158K diverse instruction data with uniform sampling of conversation, descriptions, and reasoning; for Science QA as single-turn tasks combining question, context, reasoning, and answers. This lightweight approach yields general-purpose visual assistants excelling in chat and reasoning.

4 Visual Instruction Tuning

The primary goal is to effectively leverage the capabilities of both the pre-trained LLM and visual model. The network archtecture is illustrated in Figure 1. We choose Vicuna [29] as our LLM

fϕ(⋅)f_{\phi}(\cdot)

parameterized by

ϕ\phi

, as it has the best instruction following capabilities in language tasks among publicly available checkpoints [28, 29, 30].

For an input image

Xv\mathbf{X}_{\text{v}}

, we consider the pre-trained CLIP visual encoder ViT-L/14 [5], which provides the visual feature

Zv=g(Xv)\mathbf{Z}_{\text{v}} = g(\mathbf{X}_{\text{v}})

. The grid features before and after the last Transformer layer are considered in our experiments. We consider a simple linear layer to connect image features into the word embedding space. Specifically, we apply a trainable projection matrix

W\mathbf{W}

to convert

Zv\mathbf{Z}_{\text{v}}

into language embedding tokens

Hv\mathbf{H}_{\text{v}}

, which have the same dimensionality as the word embedding space in the language model:

Thus, we have a sequence of visual tokens

Hv\mathbf{H}_{\text{v}}

. Note that our simple projection scheme is lightweight, which allows us to iterate data centric experiments quickly. More sophisticated schemes to connect the image and language representations can also be considered, such as gated cross-attention in Flamingo [51] and Q-former in BLIP-2 [17]. We leave exploring possibly more effective and sophisticated architecture designs for LLaVA as future work.

4.2 Training

For each image

Xv\mathbf{X}_{\text{v}}

, we generate multi-turn conversation data

,XqT,XaT)({{\bf X}}_{\texttt{q}}^1, {{\bf X}}_{\texttt{a}}^1, \cdots, {{\bf X}}_{\texttt{q}}^T, {{\bf X}}_{\texttt{a}}^T)

, where

T

is the total number of turns. We organize them as a sequence, by treating all answers as the assistant's response, and the instruction

Xinstructt\mathbf{X}_{\text{instruct}}^t

at the

t

-th turn as:

\mathbf{X}_{\text{instruct}}^t = \begin{cases} \text{Randomly choose } [\mathbf{X}_{\text{q}}^1, \mathbf{X}_{\text{v}}] \text{ or } [\mathbf{X}_{\text{v}}, \mathbf{X}_{\text{q}}^1], & \text{the first turn } t=1 \\ \mathbf{X}_{\text{q}}^t, & \text{the remaining turns } t > 1 \end{cases}\tag{2}

💭 Click to ask about this equation

(2)

This leads to the unified format for the multimodal instruction-following sequence illustrated in Table 2. We perform instruction-tuning of the LLM on the prediction tokens, using its original auto-regressive training objective.

Table 2: The input sequence used to train the model. Only two conversation turns are illustrated here; in practice, the number of turns varies based on the instruction-following data. In our current implementation, we follow Vicuna-v0 [29] to set the system message ${{\bf X}}_{\texttt{system-message}}$ and we set <STOP> = ###. The model is trained to predict the assistant answers and where to stop, and thus only green sequence/tokens are used to compute the loss in the auto-regressive model.

💭 Click to ask about this figure

Specifically, for a sequence of length

L

, we compute the probability of the target answers

Xa{{\bf X}}_{\texttt{a}}

by:

where

θ\theta

is the trainable parameters,

Xinstruct,<i\mathbf{X}_{\text{instruct}, <i}

and

Xa,<i\mathbf{X}_{\text{a}, <i}

are the instruction and answer tokens in all turns before the current prediction token

\boldsymbol{x}_i}

, respectively. Please see Table 2 for an illustration of the prediction tokens. For the conditionals in Equation 3, we explicitly add

Xv\mathbf{X}_{\text{v}}

to emphasize the fact that the image is grounded for all answers, and we omit

Xsystem-message\mathbf{X}_{\text{system-message}}

and all previous <STOP> for better readability. For LLaVA model training, we consider a two-stage instruction-tuning procedure.

Stage 1: Pre-training for Feature Alignment.

To strike a balance between concept coverage and training efficiency, we filter CC3M to 595K image-text pairs. Please see Appendix for details of the filtering process. These pairs are converted to the instruction-following data using the naive expansion method describe in Section 3. Each sample can be treated as a single-turn conversation. To construct the input

Xinstruct\mathbf{X}_{\text{instruct}}

in Equation 2, for an image

Xv\mathbf{X}_{\text{v}}

, a question

Xq\mathbf{X}_{\text{q}}

is randomly sampled, which is a language instruction to request the assistant to describe the image briefly. The ground-truth prediction answer

Xa\mathbf{X}_{\text{a}}

is the original caption. In training, we keep both the visual encoder and LLM weights frozen, and maximize the likelihood of Equation 3 with trainable parameters

θ=W\theta = \mathbf{W}

(the projection matrix) only. In this way, the image features

Hv\mathbf{H}_{\text{v}}

can be aligned with the pre-trained LLM word embedding. This stage can be understood as training a compatible visual tokenizer for the frozen LLM.

Stage 2: Fine-tuning End-to-End.

We always keep the visual encoder weights frozen, and continue to update both the pre-trained weights of the projection layer and LLM in LLaVA; i.e., the trainable parameters are

θ={W,ϕ}\theta = \{\mathbf{W}, \phi \}

in Equation 3. We consider two specific use case scenarios:

Multimodal Chatbot. We develop a Chatbot by fine-tuning on the 158K language-image instruction-following data in Section 3. Among the three types of responses, conversation is multi-turn while the other two are single-turn. They are uniformly sampled in training.
Science QA. We study our method on the ScienceQA benchmark [31], the first large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations. Each question is provided a context in the form of natural language or an image. The assistant provides the reasoning process in natural language and selects the answer among multiple choices. For training in Equation 2, we organize the data as a single turn conversation, the question & context as $Xinstruct\mathbf{X}_{\text{instruct}}$ , and reasoning & answer as $Xa\mathbf{X}_{\text{a}}$ .

5 Experiments

Show me a brief summary.

In this section, experiments rigorously assess LLaVA's instruction-following and visual reasoning in multimodal chatbot and ScienceQA settings, using two-stage tuning on filtered CC3M pairs and 158K GPT-generated instruction data with Vicuna LLM and projected CLIP features. Qualitative demos rival GPT-4 on complex prompts, outperforming BLIP-2 and OpenFlamingo by focusing on user intent over mere description. GPT-4-judged LLaVA-Bench yields 85% relative score on COCO tasks and 67% in-the-wild, with all data types optimal. ScienceQA accuracy reaches 90.92%, and GPT-4 ensembling as judge sets new state-of-the-art at 92.53%. Ablations validate pre-training for alignment, larger scale, penultimate CLIP features, and reasoning-first for faster convergence, underscoring visual instruction tuning's efficacy.

We assess the performance of LLaVA in instruction-following and visual reasoning capabilities with two primary experimental settings: multimodal chatbot and the ScienceQA dataset, respectively. We train all models with 8

×\times

A100s, following Vicuna's hyperparameters [29]. We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. See Appendix for more training details.

5.1 Multimodal Chatbot

We developed a chatbot demo to show the image understanding and conversation abilities of LLaVA, and to study how well LLaVA is able to digest visual inputs and exhibit instruction-following capabilities. We first use the examples in the original GPT-4 paper [26], shown in Table 3 (more examples in Appendix), that require in-depth image understanding. For comparisons, we quote the prompt and response of the multimodal GPT-4 from their paper, and query BLIP-2 and OpenFlamingo model checkpoints to get their response.

Surprisingly, although LLaVA is trained with a small multimodal instruction-following dataset (~80K unique images), it demonstrates quite similar reasoning results with multimodal GPT-4 on these examples. Note that while these images are out-of-domain for LLaVA, LLaVA is still able to understand the scenes and follow the question instruction to provide a reasonable response. In contrast, BLIP-2 and OpenFlamingo focus on describing the image, instead of following the user instruction to answer in an appropriate manner.

Quantitative Evaluation.

To gain a systematic understanding of the performance of LLaVA, we propose a quantitative metric to measure the model's instruction-following capability on multimodal data. Inspired by [29], we leverage GPT-4 to measure the quality of generated responses. Specifically, we create triplets consisting of image, ground-truth textual descriptions, and question. The candidate models (e.g., LLaVA) predict the answers based on the question and the image. To provide an approximate theoretical upper bound, we create a reference prediction based on the question and the ground-truth textual descriptions, using the text-only GPT-4. After obtaining the responses from both models, we feed the question, visual information (in the format of textual descriptions), and the generated responses from both assistants, to the judge (i.e., text-only GPT-4). It evaluates the helpfulness, relevance, accuracy, and level of detail of the responses from the assistants, and gives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. It is also asked to provide a comprehensive explanation for the evaluation, for us to better understand the models. We report relative scores w.r.t. the text-only GPT-4 model that uses the textural ground truth description as visual input. We create two benchmarks to evaluate the model's performance.

Table 4: Ablation on LLaVA-Bench (COCO) with different training data. We report relative scores w.r.t. a text-only GPT-4 model that uses ground truth image captions and bounding boxes as visual input. We prompt GPT-4 with the answers from our model outputs and the answers by GPT-4 (text-only), and let it compare between both responses and give a rating with an explanation.

	Conversation	Detail description	Complex reasoning	All
Full data	83.1	75.3	96.5	85.1
Detail + Complex	81.5 (-1.6)	73.3 (-2.0)	90.8 (-5.7)	81.9 (-3.2)
Conv + 5% Detail + 10% Complex	81.0 (-2.1)	68.4 (-7.1)	91.5 (-5.0)	80.5 (-4.4)

Table 5: Instruction-following capability comparison using relative scores on LLaVA-Bench (In-the-Wild). The results are reported in the format of mean $\pm$ std. For the first three rows, we report three inference runs. LLaVA performs significantly better than others. $^\dagger$ For a given set of LLaVA decoding sequences, we evaluate by querying GPT-4 three times; GPT-4 gives a consistent evaluation.

	Conversation	Detail description	Complex reasoning	All
OpenFlamingo [55]	19.3 $\pm$ 0.5	19.0 $\pm$ 0.5	19.1 $\pm$ 0.7	19.1 $\pm$ 0.4
BLIP-2 [17]	54.6 $\pm$ 1.4	29.1 $\pm$ 1.2	32.9 $\pm$ 0.7	38.1 $\pm$ 1.0
LLaVA	57.3 $\pm$ 1.9	52.5 $\pm$ 6.3	81.7 $\pm$ 1.8	67.3 $\pm$ 2.0

LLaVA-Bench (COCO).

We randomly select 30 images from COCO-Val-2014, and for each image, we generate three types of questions (conversation, detailed description, complex reasoning) using the proposed data generation pipeline in Section 3, totaling 90 questions. This benchmark studies the model's alignment behavior and capabilities with consistent visual inputs. We vary the training datasets to study the effectiveness of different types of instruction-following data, and show the results in Table 4. First, with instruction tuning, the model's ability of following user instructions improves significantly by over 50 points. Second, adding a small amount of detailed description and complex reasoning questions contributes to a considerable improvement of the model's overall capability by 7 points. Furthermore, it also improves the model's performance on conversational questions, suggesting that improvements in reasoning capabilities complement conversational abilities. Finally, we show that having all three types of data yields the best performance at 85.1%.

LLaVA-Bench (In-the-Wild).

To evaluate the model's capability in more challenging tasks and generalizability to novel domains, we collect a diverse set of 24 images with 60 questions in total, including indoor and outdoor scenes, memes, paintings, sketches, etc., and associate each image with a highly-detailed and manually-curated description and a proper selection of questions. We compare LLaVA, BLIP, and OpenFlamingo in Table 5. Thanks to visual instruction tuning, LLaVA achieves significantly better performance compared with BLIP-2 (+29%) and OpenFlamingo (+48%). Compared to the text-only GPT-4 that has access to ground-truth labels, LLaVA achieves an impressive 81.7% performance on complex reasoning questions, with an overall score of 67.3%.

Limitations.

This LLaVA-Bench (In-the-Wild) is designed to be challenging and to reveal a model's weaknesses. We provide two examples with associated captions and questions in Table 6. For the ramen example (left), to correctly answer the name of the restaurant, it requires the model to have a large knowledge coverage and multilingual understanding capability; to correctly describe the side dishes, the model may need to retrieve relevant multimodal information from Internet. For the fridge example (right), perceiving the correct brand of the yogurt requires the model to process high resolution images and possess extensive knowledge coverage. We also observed an interesting failure of LLaVA, as it responds with yes when asked if strawberry-flavored yogurt is present, even though the fridge contains only yogurt and strawberries. This indicates that, at times, LLaVA perceives the image as a "bag of patches", failing to grasp the complex semantics within the image. We hope LLaVA serves as a solid baseline on the benchmarks, on which our findings can inspire future work in developing more capable LMMs.

5.2 ScienceQA

ScienceQA [31] contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. The benchmark dataset is split into training, validation, and test splits with 12726, 4241, and 4241 examples, respectively. We consider two representative methods, including GPT-3.5 model ( text-davinci-002) with and without chain-of-thought (CoT), LLaMA-Adapter ([56]), as well as multimodal chain-of-thought (MM-CoT) [62], which is the current SoTA method on this dataset. For more baseline numbers, please see [31].

The results are reported in Table 7. For LLaVA, we use the visual features before the last layer, ask the model to first predict reasons and then the answer, and train it for 12 epochs. It yields 90.92% accuracy, which is quite close to the SoTA 91.68%. To explore the limit of LLMs, we also prompt GPT-4 using 2-shot in-context-learning and achieve 82.69% accuracy, which is a 7.52% absolute gain compared with 75.17% from GPT-3.5. For a substantial number of questions, we note that GPT-4 fails simply because it reports that there is insufficient context such as images or plots. We consider two schemes to combine the outcomes from our model and GPT-4.

(i)

A GPT-4 complement. Whenever GPT-4 fails to provide answers, we use the prediction from our method. This schemes yields 90.97% accuracy, which is almost the same as applying our method alone.

(ii)

GPT-4 as the judge. Whenever GPT-4 and LLaVA produce different answers, we prompt GPT-4 again, asking it to provide its own final answer based on the question and two outcomes. The spirit is similar with CoT, but with the external knowledge from the other model. Surprisingly, this scheme is able to provide consistent improvement over all question classes, and achieves a new SoTA accuracy of 92.53%. Interestingly, the text-only GPT-4, which cannot process images, improves the overall performance of the model on questions that have an image as context. This is because some of these questions do not actually require the image context for a correct answer. The GPT-4 judge can identify such cases and correct some of the errors that LLaVA makes. See the example in Appendix. To the best of our knowledge, this is the first time that GPT-4 is used for model ensembling. We hope this finding can encourage future research to explore more effective methods to leverage LLMs for model ensembling.

Table 7: Accuracy (%) on Science QA dataset. Question categories: NAT = natural science, SOC = social science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1-6 = grades 1-6, G7-12 = grades 7-12. $^\dagger$ Text-only GPT-4, our eval. Our novel model ensembling with the text-only GPT-4 consistently improves the model's performance under all categories, setting the new SoTA performance.

Method	Subject			Context Modality			Grade		Average
Method	NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12	Average
Representative SoTA methods with numbers reported in the literature
Human [31]	90.23	84.97	87.48	89.60	87.50	88.10	91.59	82.42	88.40
GPT-3.5 [31]	74.64	69.74	76.00	74.44	67.28	77.42	76.80	68.89	73.97

Ablations.

We ablate several design choices on ScienceQA in Table 8.

(i)

Visual features. We tried using the last layer feature from CLIP vision encoder, which yields 89.96% and is 0.96% lower than the feature before the last layer. We hypothesize that this is because CLIP's last layer features may focus more on global and abstract image properties compared to the layer before it, which can focus more on localized properties that are useful for understanding specific image details.

(ii)

Chain-of-thought. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training. Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence, but contributes relatively little to the final performance.

(iii)

Pre-training. We skip pre-training and directly train on Science QA from scratch – performance drops to 85.81% accuracy. The 5.11% absolute degradation indicates the importance of our pre-training stage, in aligning multimodal features while preserving the vast pre-trained knowledge.

(i v)

Model size. We keep all configurations the same as our best 13B model, and train a 7B model. This yields 89.84% accuracy, which is 1.08% lower than 90.92%, demonstrating the importance of model scale.

6 Conclusion

Show me a brief summary.

In this section, visual instruction tuning tackles the challenge of enabling multimodal models to follow human intent across diverse visual tasks. An automatic pipeline generates high-quality language-image instruction-following data, powering the training of LLaVA, which delivers state-of-the-art accuracy on ScienceQA after fine-tuning and excels in visual chat with multimodal data. The work introduces the first benchmark for multimodal instruction-following, marks an initial focus on real-life applications with extended academic results in [63], and aims to inspire future development of advanced multimodal models.

This paper demonstrated the effectiveness of visual instruction tuning. We presented an automatic pipeline to create language-image instruction-following data, based on which we train LLaVA, a multimodal model to follow human intent to complete visual tasks. It achieves the new SoTA accuracy when fine-tuned on ScienceQA, and excellent visual chat capabilities when fine-tuned on multimodal chat data. Besides, we present the first benchmark to study multimodal instruction-following capability. This paper is an initial step in visual instruction tuning, and mainly focuses on real-life tasks. For more quantitative results of LLaVA on academic benchmarks, please refer to the improved baselines with visual instruction tuning [63]. We hope our work can inspire future research on building more capable multimodal models.

Acknowledgements. We thank Baolin Peng and Pan Lu for valuable discussions on instruction-tuning language models and Science QA, respectively. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna. This work was supported in part by NSF CAREER IIS2150012, and Institute of Information & communications Technology Planning & Evaluation(IITP) grants funded by the Korea government(MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) and (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training).

A Broader Impact

Show me a brief summary.

In this section, LLaVA's release as a general-purpose visual assistant entails benefits like advancing research and applications alongside risks such as malicious inputs, hallucinations, biases inherited from CLIP and LLaMA/Vicuna, high energy demands at scale, and complex evaluations needing finer hallucination and visual understanding metrics. Mitigations employ OpenAI text filters and NSFW image blockers. Ultimately, community-driven investigation outweighs harms, spurring mitigation strategies, innovation, and responsible vision-language foundation model progress.

The broader impact of LLaVA, a general-purpose visual assistant, has potential benefits and risks associated with its deployment and release. Some considerations are unique to LLaVA due to its visual nature, while others share similarities with existing instruction-following LLMs (e.g., Alpaca, Vicuna, etc.). As LLaVA is built upon LLaMA, Vicuna, and CLIP, it inherits some of the issues associated with LLMs and vision encoders. In the following, we outline both the risks and mitigation strategies in place for the release of this model.

Malicious input.

To minimize potential misuse and harmful consequences, we employ two precautionary measures for LLaVA: (1) OpenAI Filter API for user input text to prevent harmful or inappropriate text instructions from being processed by the model, and (2) NSFW Filter for uploaded user images to detect and block Not Safe For Work (NSFW) content or any other potentially harmful image inputs.

Hallucination.

Similar to LLMs, LLaVA might generate outputs that aren't grounded in facts or input data. This raises concerns about inferences made, especially in critical applications (e.g., medical).

Biases.

Bias can be transferred from the base models to LLaVA, both from the vision encoder (CLIP) and the language decoder (LLaMA/Vicuna). This may lead to biased outcomes or unfair representations of diverse content.

Energy consumption.

Though energy consumption is not a primary concern for LLaVA due to a smaller pretraining dataset (see details in Appendix C), it may become a concern when scaling up the pretraining dataset or increasing the model size, e.g., to a larger LLaMA version like the 65B model.

Evaluation complexities.

Assessing the performance of LLaVA is challenging as it involves both language and visual tasks. Our evaluation benchmark covers several aspects, including accuracy, concept coverage, reasoning ability, and creativity. However, additional aspects need consideration, such as the degree of visual content hallucination and fine-grained understanding of visual content. While text-only GPT-4 based multimodal evaluation is consistent and accurate in our study, its robustness in different situations and capability to evaluate other unexplored aspects are subjects for future work.

Despite these risks, we believe that the benefits of releasing LLaVA to the research community outweigh the potential harm. It allows for ongoing investigation and improvement of the model and engages the community in developing better mitigation strategies to address these concerns. Moreover, the release of LLaVA can spur the development of new applications and research directions, ultimately contributing to the progress and responsible deployment of foundation models in vision-language tasks.

B More Results

Show me a brief summary.

In this section, additional qualitative results reveal LLaVA's emergent behaviors and untapped potential beyond its training data. Through examples like generating functional HTML/JS/CSS code from sketches, delivering detailed conversational responses to visual prompts, linking images to pretrained textual knowledge, recognizing unseen figures such as Elon Musk in headshots and memes, and performing strong OCR on rare training instances, LLaVA showcases generalization powered by CLIP's visual encoding and LLaMA/Vicuna's language capabilities. These findings, illustrated across tables and figures, affirm LLaVA's versatility for diverse applications while calling for future probes into underlying mechanisms to foster more robust, bias-reduced vision-language models.

We present more qualitative results of LLaVA to analyze its emergent behaviors and observed weaknesses. For more quantitative results of LLaVA on academic benchmarks, please refer to the improved baselines with visual instruction tuning [63]. In Table 9, LLaVA demonstrates a similar behavior as GPT-4 in another example from its paper. Similar to the GPT-4 live demo by OpenAI, LLaVA is capable of generating the HTML/JS/CSS code for an interactive joke website based on a simplified user input sketch in Fig. 2, despite a minor error. As shown in Fig. 3, LLaVA can follow user's instructions in a conversational style and provide detailed responses or creative writings. Furthermore, LLaVA is able to relate the visual content to the textual knowledge from the pretrained LLM, as demonstrated in Fig. 4 and Fig. 5.

One interesting emergent behavior of LLaVA is that it is able to understand visual contents that are not covered in the training. For example, in Fig. 6, it is able to recognize Elon Musk both in a headshot and in a humorous meme where he is dressed as a doge, even though Elon Musk never appears in the training data for either the visual feature alignment or visual instruction tuning stages of LLaVA. LLaVA also demonstrates impressive OCR (optical character recognition) ability in Table 9 and Fig. 2, which is rarely covered in our training data.

We hope these additional results and observations showcase the potential of LLaVA in various application areas. In future work, it is important to investigate these emergent behaviors more thoroughly and to understand the underlying mechanisms that enable LLaVA to demonstrate such generalization abilities. This will pave the way towards building better LMMs, including enhancing robustness, reducing biases, and improving the alignment and the scope of the learned vision-language representations.

Figure 6: An interesting emergent behavior of LLaVA is its ability to recognize Elon Musk both in a headshot and in a humorous meme where he is dressed as a doge. This implies that the pre-trained CLIP vision encoder may have seen images of Elon Musk. However, it is still surprising because Elon Musk never appears in the training data for either the visual feature alignment or visual instruction tuning stages of LLaVA, which indicates that the base language model generalizes to unseen visual concepts.

💭 Click to ask about this figure

C Training Details

Show me a brief summary.

In this section, efficient training of LLaVA balances multimodal alignment with instruction-following on modest hardware. Pre-training on the filtered CC-595K subset runs one epoch at learning rate 2e-3 and batch size 128, followed by fine-tuning on LLaVA-Instruct-158K for three epochs at 2e-5 learning rate and batch size 32, employing Adam optimization without weight decay, cosine scheduling with 3% warmup, FSDP, gradient checkpointing, BF16, and TF32 for memory and precision efficiency. Conducted on eight A100 GPUs, pre-training finishes in 4 hours, instruction fine-tuning in 10 hours, and ScienceQA fine-tuning in 4 hours, enabling rapid iteration toward state-of-the-art performance.

We pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128, and fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32. Following Vicuna, we use the Adam optimizer with no weight decay and a cosine learning rate with a warmup ratio of 3%. During finetuning, FSDP (Full Shard Data Parallel) and gradient checkpointing is used to save GPU memory, and offloading is not used. BF16 and TF32 are enabled to achieve a balance between speed and precision.

We train all models with 8

×\times

A100s. Pretraining on CC-595K completes within 4 hours. Finetuning on Instruct-158K completes within 10 hours. Finetuning on ScienceQA completes within 4 hours.

D Assets

Show me a brief summary.

In this section, the LLaVA project addresses reproducibility by uploading essential assets to an anonymized GitHub repository. It includes source code, README, web demo instructions, GPT-4 prompts with few-shot examples, the LLaVA-Instruct-158K dataset, LLaVA-Bench evaluations on COCO and in-the-wild images, and 25GB compressed model checkpoints. Exceeding GitHub LFS limits, checkpoints await public release or reviewer requests. This comprehensive sharing accelerates multimodal research and enables direct extension of visual instruction tuning.

Our source code, generated instruction-tuning data, proposed benchmark are uploaded to the anonymized GitHub repository: LLaVA-Annonymous/LLaVA.

Source Code: link
README: link
Instructions to launch the demo: link
All prompts and few shot examples for querying GPT-4: link
LLaVA-Instruct-158K: link
LLaVA-Bench: COCO, In-The-Wild
Model checkpoints. The size of the model checkpoints after compression is 25GB, which exceeds the 5GB limit of GitHub LFS (Large File Storage). We'll release the checkpoint to the public, or upon request with reviewers for this submission.

E Data

Instructions for brief image description.

The list of instructions used to briefly describe the image content are shown in Table 11. They present the same meaning with natural language variance.

Instructions for detailed image description.

The list of instructions used to describe the image content in detail are shown in Table 12. They present the same meaning with natural language variance.

Unable to load figure

Multi-image figure could not be rendered

CC3M.

We extract noun-phrases using Spacy for each caption over the whole CC3M dataset, and count the frequency of each unique noun-phrase. We skip noun-phrases whose frequency is smaller than

3

, as they are usually rare combinations concept and attributes that has already been covered by other captions. We then start from the noun-phrases with lowest remaining frequency, add the captions that contain this noun-phrase to the candidate pool. If the frequency of the noun-phrase is larger than

100

, we randomly choose a subset of size

100

out of all its captions. This results in around 595K image-text pairs.

The comparison of noun-phrase statistics before and after filtering CC3M is shown in Figure 8. The filtered dataset shows a good coverage of concepts whose frequency is higher from 3, but with a smaller number of image-text pairs.

F Prompts

Show me a brief summary.

In this section, prompts enable text-only GPT-4/ChatGPT to generate multimodal instruction-following data from image contexts like captions and bounding boxes, bypassing direct image input. Table 13 details the construction process using few-shot in-context learning—drawing examples from fewshot_samples to form final messages that elicit responses such as conversations—while Tables 14 and 15 illustrate full examples of contexts prompting detailed descriptions, complex reasoning, and dialogues. This streamlined approach yields diverse, structured data like LLaVA-Instruct-158K, powering effective visual instruction tuning without visual encoders during generation.

The prompt used to generate image-based conversation from ChatGPT/GPT-4 is shown in Table 13.

Table 13: For each query, we illustrate the prompt construction process for ChatGPT/GPT-4 to collect query['response'] from query['context'], using few-shot in-context-learning, where examples are from fewshot_samples, each example including input sample['context'] and output sample['response']. Note that messages is the final prompt. In this example, we provide the prompt used to generate the conversation response, please see also see its in-context-learning examples in Table 15 and Table 16 for details. We recommend readers to check out the codebase for the prompts to generated two other types of responses, including detailed decription and complex reasoning.

💭 Click to ask about this figure

References

Show me a brief summary.

In this section, the references compile 63 pivotal works addressing the challenge of integrating vision and language in foundation models to create capable multimodal assistants. Core contributions span transferable visual models like CLIP, large language models such as LLaMA and Vicuna, instruction-tuning paradigms from Alpaca to GPT-4, and datasets including CC3M, LAION-5B, and COCO, alongside benchmarks for reasoning, navigation, and generation. This synthesis reveals rapid evolution toward efficient, aligned vision-language systems, enabling emergent abilities like OCR and generalization in models like LLaVA while highlighting paths to mitigate biases and hallucinations.

[1] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.

[2] Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks, 2022.

[3] Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023.

[4] Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 2022.

[5] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.

[6] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. July 2021. If you use this software, please cite it as below.

[7] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.

[8] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Lu Yuan, Ce Liu, and Jianfeng Gao. Unified contrastive learning in image-text-label space. CVPR, 2022.

[9] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050, 2021.

[10] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In CVPR, 2022.

[11] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.

[12] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.

[13] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language-driven semantic segmentation. ICLR, 2022.

[14] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022.

[15] Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131, 2023.

[16] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.

[17] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.

[18] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.

[19] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. CVPR, pages 10674–10685, 2022.

[20] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. ArXiv, abs/2206.10789, 2022.

[21] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. ArXiv, abs/2203.13131, 2022.

[22] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022.

[23] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023.

[24] CVinW. Computer vision in the wild. https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings, 2022.

[25] OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023.

[26] OpenAI. Gpt-4 technical report, 2023.

[27] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

[28] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.

[29] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.

[30] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.

[31] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022.

[32] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.

[33] Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, 2020.

[34] Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS), 2021.

[35] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instruct pix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.

[36] Langchain. https://github.com/hwchase17/langchain, 2022.

[37] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.

[38] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.

[39] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022.

[40] D'ıdac Sur'ıs, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.

[41] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[42] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.

[43] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

[44] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

[45] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.

[46] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022.

[47] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.

[48] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.

[49] Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Dániel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.

[50] Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, and Oncel Tuzel. Reinforce data, multiply impact: Improved model accuracy and robustness with dataset reinforcement. arXiv preprint arXiv:2303.08983, 2023.

[51] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.

[52] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.

[53] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.

[54] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.

[55] Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023.

[56] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.

[57] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, 2022.

[58] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.

[59] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.

[60] Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.

[61] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.

[62] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.

[63] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.