BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Junnan Li
Salesforce Research

Dongxu Li
Salesforce Research

Caiming Xiong
Salesforce Research

Steven Hoi
Salesforce Research

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released.

Executive Summary: Vision-language models, which combine computer vision and natural language processing to handle tasks like describing images or answering questions about visuals, have seen rapid progress but face key challenges. Existing models typically specialize in either understanding tasks, such as matching images to text descriptions, or generation tasks, like creating captions from images. They also depend heavily on vast but noisy datasets scraped from the web, where text often poorly matches the visuals, limiting their accuracy and versatility. With growing demand for unified AI systems in applications like search engines, virtual assistants, and content analysis, there is an urgent need for models that perform well across both task types while making better use of available data.

This paper introduces BLIP, a new framework for vision-language pre-training aimed at unifying understanding and generation tasks while effectively handling noisy web data. It seeks to demonstrate that a flexible model architecture and data improvement technique can outperform current methods on a broad set of benchmarks without requiring massive increases in data or computing power.

The approach centers on two main innovations. First, developers created a multimodal mixture of encoder-decoder model, a flexible architecture that can switch between encoding images and text for matching tasks or decoding to generate text from visuals. This model was pre-trained using three objectives: aligning image-text features, matching pairs accurately, and generating language conditioned on images. Second, they developed Captioning and Filtering, a bootstrapping method to clean and enrich datasets. Starting with a small set of high-quality, human-annotated image-text pairs (about 200,000 from sources like COCO), they fine-tuned specialized modules—a caption generator for synthetic descriptions and a filter to remove mismatches—from the main model. These were applied to larger web datasets totaling 14 million to 129 million noisy pairs, filtering out about 25% of poor texts and adding diverse synthetic ones. Pre-training occurred over 20 epochs on standard hardware, using vision transformers for image processing and BERT-like components for text, with assumptions that web-scale data remains valuable if cleaned.

Key findings highlight BLIP's superior performance. On image-text retrieval, BLIP improved average recall by 2.7 percentage points over the prior best method, ALBEF, using the same 14 million images—meaning it finds relevant matches about 3% more often in top results. For image captioning, it boosted CIDEr scores by 2.8 points on the NoCaps dataset, producing more accurate and diverse descriptions. In visual question answering, accuracy rose by 1.6 points on VQA benchmarks, outperforming models trained on 13 times more data. BLIP also excelled in visual reasoning and dialog tasks, often leading state-of-the-art results. Notably, without any video-specific training, it achieved top zero-shot performance on video retrieval and question answering, surpassing video-tuned models by up to 12 percentage points in recall, by simply sampling frames as static images.

These results imply that BLIP reduces reliance on raw data volume, cutting costs and timelines for training multimodal AI while enhancing safety and reliability through better alignment between visuals and text—crucial for real-world uses like automated content moderation or accessible tech. Unlike expectations that bigger datasets alone drive gains, the work shows noisy web data hurts more than helps without processing, and diverse synthetic captions add unique value over repetitive web texts. This challenges prior focus on scale, proving targeted data quality yields outsized performance with less compute, as BLIP matches or beats larger rivals like SimVLM despite using far smaller setups.

Next steps should involve adopting BLIP as a base for vision-language applications, fine-tuning it on domain-specific data for tasks like e-commerce search or medical imaging. For broader impact, pursue multiple rounds of data bootstrapping to refine datasets further, generate several captions per image to expand training scale, or ensemble multiple models for robustness—though the latter adds complexity. Before full deployment, pilot zero-shot transfers to new domains like audio-visual tasks.

While robust across 14 experiments on standard benchmarks, limitations include dependence on initial human-annotated data for module fine-tuning, potential biases in web sources, and untested scalability beyond 129 million pairs. Confidence in the core findings is high, given consistent gains across tasks and ablations confirming benefits stem from data quality, not just extended training; caution is advised for highly specialized or low-data scenarios where more tailored validation is needed.

1. Introduction

Section Summary: Vision-language pre-training has achieved great success in handling tasks that combine images and text, but current methods struggle with inflexible model architectures that don't easily switch between understanding and generating text, and with noisy data from the web that limits learning. To address this, researchers introduce BLIP, a new framework called Bootstrapping Language-Image Pre-training, which uses a versatile model architecture known as Multimodal Mixture of Encoder-Decoder for multi-task training and a data improvement technique called Captioning and Filtering to create cleaner, synthetic captions from noisy web images. Experiments demonstrate that BLIP outperforms existing approaches on a broad array of vision-language tasks like image captioning and visual question answering, and even excels in zero-shot applications for video tasks.

Vision-language pre-training has recently received tremendous success on various multimodal downstream tasks. However, existing methods have two major limitations:

(1) Model perspective: most methods either adopt an encoder-based model [1, 2], or an encoder-decoder [3, 4] model. However, encoder-based models are less straightforward to directly transfer to text generation tasks (e.g. image captioning), whereas encoder-decoder models have not been successfully adopted for image-text retrieval tasks.

(2) Data perspective: most state-of-the-art methods (e.g., CLIP [1], ALBEF [2], SimVLM [4]) pre-train on image-text pairs collected from the web. Despite the performance gain obtained by scaling up the dataset, our paper shows that the noisy web text is suboptimal for vision-language learning.

To this end, we propose BLIP: Bootstrapping Language-Image Pre-training for unified vision-language understanding and generation. BLIP is a new VLP framework which enables a wider range of downstream tasks than existing methods. It introduces two contributions from the model and data perspective, respectively:

(a) Multimodal mixture of Encoder-Decoder (MED): a new model architecture for effective multi-task pre-training and flexible transfer learning. An MED can operate either as a unimodal encoder, or an image-grounded text encoder, or an image-grounded text decoder. The model is jointly pre-trained with three vision-language objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling.

(b) Captioning and Filtering (CapFilt): a new dataset boostrapping method for learning from noisy image-text pairs. We finetune a pre-trained MED into two modules: a captioner to produce synthetic captions given web images, and a filter to remove noisy captions from both the original web texts and the synthetic texts.

**Figure 1:** We use a Captioner (Cap) to generate synthetic captions for web images, and a Filter (Filt) to remove noisy captions.

We perform extensive experiments and analysis, and make the following key observations.

We show that the captioner and the filter work together to achieve substantial performance improvement on various downstream tasks by bootstrapping the captions. We also find that more diverse captions yield larger gains.
BLIP achieves state-of-the-art performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog. We also achieve state-of-the-art zero-shot performance when directly transferring our models to two video-language tasks: text-to-video retrieval and videoQA.

2. Related Work

Section Summary: Vision-language pre-training involves training models on vast collections of web-crawled image-text pairs to boost performance on tasks like image retrieval and captioning, though the noisy quality of these texts often hinders results, which the authors address with their CapFilt method for cleaner data use. Efforts to combine understanding and generation tasks in single models have faced architectural challenges, but the proposed multimodal mixture of encoder-decoder design provides greater flexibility and efficiency. Related techniques like knowledge distillation transfer insights from stronger models to improve vision-language learning, while data augmentation via synthetic captions from generative language models proves beneficial for large-scale pre-training.

2.1 Vision-language Pre-training

Vision-language pre-training (VLP) aims to improve performance of downstream vision and language tasks by pre-training the model on large-scale image-text pairs. Due to the prohibitive expense of acquiring human-annotated texts, most methods [5, 6, 2, 4, 1] use image and alt-text pairs crawled from the web [7, 8, 9], Despite the use of simple rule-based filters, noise is still prevalent in the web texts. However, the negative impact of the noise has been largely overlooked, shadowed by the performance gain obtained from scaling up the dataset. Our paper shows that the noisy web texts are suboptimal for vision-language learning, and proposes CapFilt that utilizes web datasets in a more effective way.

There have been many attempts to unify various vision and language tasks into a single framework [10, 3, 4]. The biggest challenge is to design model architectures that can perform both understanding-based tasks (e.g. image-text retrieval) and generation-based tasks (e.g. image captioning). Neither encoder-based models [2, 11, 1] nor encoder-decoder models [3, 4] can excel at both types of tasks, whereas a single unified encoder-decoder [10] also limits the model's capability. Our proposed multimodal mixture of encoder-decoder model offers more flexibility and better performance on a wide range of downstream tasks, in the meantime keeping the pre-training simple and efficient.

2.2 Knowledge Distillation

Knowledge distillation (KD) [12] aims to improve the performance of a student model by distilling knowledge from a teacher model. Self-distillation is a special case of KD where the teacher and student have equal sizes. It has been shown to be effective for image classification [13], and recently for VLP [2]. Different from mostly existing KD methods which simply enforce the student to have the same class predictions as the teacher, our proposed CapFilt can be interpreted as a more effective way to perform KD in the context of VLP, where the captioner distills its knowledge through semantically-rich synthetic captions, and the filter distills its knowledge by removing noisy captions.

2.3 Data Augmentation

While data augmentation (DA) has been widely adopted in computer vision [14], DA for language tasks is less straightforward. Recently, generative language models have been used to synthesize examples for various NLP tasks [15, 16, 17, 18]. Different from these methods which focus on the low-resource language-only tasks, our method demonstrates the advantage of synthetic captions in large-scale vision-language pre-training.

3. Method

Section Summary: BLIP introduces a versatile model architecture called MED that processes images and text together using a visual transformer for images and a BERT-like encoder for text, allowing it to encode information separately, infuse visuals into text processing, or generate text descriptions from images. The model trains on noisy image-text pairs through three objectives: aligning image and text features via contrastive learning, fine-tuning their matches with a classification task, and predicting text autoregressively for captioning, while sharing most parameters to boost efficiency. To enhance low-quality web data, BLIP's CapFilt method uses a caption generator and filter—both based on the pre-trained model and refined on clean datasets—to create better descriptions and remove mismatched pairs.

We propose BLIP, a unified VLP framework to learn from noisy image-text pairs. This section first introduces our new model architecture MED and its pre-training objectives, and then delineates CapFilt for dataset bootstrapping.

3.1 Model Architecture

We employ a visual transformer [19] as our image encoder, which divides an input image into patches and encodes them as a sequence of embeddings, with an additional [CLS] token to represent the global image feature. Compared to using pre-trained object detectors for visual feature extraction [5], using a ViT is more computation-friendly and has been adopted by the more recent methods [2, 20].

In order to pre-train a unified model with both understanding and generation capabilities, we propose multimodal mixture of encoder-decoder (MED), a multi-task model which can operate in one of the three functionalities:

(1) Unimodal encoder, which separately encodes image and text. The text encoder is the same as BERT [21], where a [CLS] token is appended to the beginning of the text input to summarize the sentence.

(2) Image-grounded text encoder, which injects visual information by inserting one additional cross-attention (CA) layer between the self-attention (SA) layer and the feed forward network (FFN) for each transformer block of the text encoder. A task-specific [Encode] token is appended to the text, and the output embedding of [Encode] is used as the multimodal representation of the image-text pair.

(3) Image-grounded text decoder, which replaces the bi-directional self-attention layers in the image-grounded text encoder with causal self-attention layers. A [Decode] token is used to signal the beginning of a sequence, and an end-of-sequence token is used to signal its end.

3.2 Pre-training Objectives

We jointly optimize three objectives during pre-training, with two understanding-based objectives and one generation-based objective. Each image-text pair only requires one forward pass through the computational-heavier visual transformer, and three forward passes through the text transformer, where different functionalities are activated to compute the three losses as delineated below.

Image-Text Contrastive Loss (ITC) activates the unimodal encoder. It aims to align the feature space of the visual transformer and the text transformer by encouraging positive image-text pairs to have similar representations in contrast to the negative pairs. It has been shown to be an effective objective for improving vision and language understanding [1, 2]. We follow the ITC loss by [2], where a momentum encoder is introduced to produce features, and soft labels are created from the momentum encoder as training targets to account for the potential positives in the negative pairs.

Image-Text Matching Loss (ITM) activates the image-grounded text encoder. It aims to learn image-text multimodal representation that captures the fine-grained alignment between vision and language. ITM is a binary classification task, where the model uses an ITM head (a linear layer) to predict whether an image-text pair is positive (matched) or negative (unmatched) given their multimodal feature. In order to find more informative negatives, we adopt the hard negative mining strategy by [2], where negatives pairs with higher contrastive similarity in a batch are more likely to be selected to compute the loss.

Language Modeling Loss (LM) activates the image-grounded text decoder, which aims to generate textual descriptions given an image. It optimizes a cross entropy loss which trains the model to maximize the likelihood of the text in an autoregressive manner. We apply a label smoothing of 0.1 when computing the loss. Compared to the MLM loss that has been widely-used for VLP, LM enables the model with the generalization capability to convert visual information into coherent captions.

In order to perform efficient pre-training while leveraging multi-task learning, the text encoder and text decoder share all parameters except for the SA layers. The reason is that the differences between the encoding and decoding tasks are best captured by the SA layers. In particular, the encoder employs bi-directional self-attention to build representations for the current input tokens, while the decoder employs causal self-attention to predict next tokens. On the other hand, the embedding layers, CA layers and FFN function similarly between encoding and decoding tasks, therefore sharing these layers can improve training efficiency while benefiting from multi-task learning,

**Figure 3:** Learning framework of BLIP. We introduce a captioner to produce synthetic captions for web images, and a filter to remove noisy image-text pairs. The captioner and filter are initialized from the same pre-trained model and finetuned individually on a small-scale human-annotated dataset. The bootstrapped dataset is used to pre-train a new model.

3.3 CapFilt

Due to the prohibitive annotation cost, there exist a limited number of high-quality human-annotated image-text pairs ${(I_h, T_h)}$ (e.g., COCO [22]). Recent work [2, 4] utilizes a much larger number of image and alt-text pairs ${(I_w, T_w)}$ that are automatically collected from the web. However, the alt-texts often do not accurately describe the visual content of the images, making them a noisy signal that is suboptimal for learning vision-language alignment.

We propose Captioning and Filtering (CapFilt), a new method to improve the quality of the text corpus. Figure 3 gives an illustration of CapFilt. It introduces two modules: a captioner to generate captions given web images, and a filter to remove noisy image-text pairs. Both the captioner and the filter are initialized from the same pre-trained MED model, and finetuned individually on the COCO dataset. The finetuning is a lightweight procedure.

Specifically, the captioner is an image-grounded text decoder. It is finetuned with the LM objective to decode texts given images. Given the web images $I_w$, the captioner generates synthetic captions $T_s$ with one caption per image. The filter is an image-grounded text encoder. It is finetuned with the ITC and ITM objectives to learn whether a text matches an image. The filter removes noisy texts in both the original web texts $T_w$ and the synthetic texts $T_s$, where a text is considered to be noisy if the ITM head predicts it as unmatched to the image. Finally, we combine the filtered image-text pairs with the human-annotated pairs to form a new dataset, which we use to pre-train a new model.

4. Experiments and Discussions

Section Summary: The section begins by detailing the pre-training process for the models, which uses PyTorch on powerful GPUs, starts with pre-trained vision and language components, and trains on large datasets of about 14 million images from curated and web sources for around 20 epochs to handle noisy text data. It then analyzes the CapFilt method, showing through experiments that combining a caption generator and noise filter improves performance on tasks like matching images to text descriptions and generating captions, especially when scaled to bigger datasets or models, with examples illustrating cleaner and more relevant text outputs. Further discussions highlight the benefits of diverse synthetic captions via random sampling over predictable methods, and the advantages of partially sharing model parameters between text processing parts to boost efficiency without losing effectiveness, while avoiding full sharing during fine-tuning to prevent biases.

In this section, we first introduce pre-training details. Then we provide a detailed experimental analysis on our method.

4.1 Pre-training Details

Our models are implemented in PyTorch [23] and pre-trained on two 16-GPU nodes. The image transformer is initialized from ViT pre-trained on ImageNet [24, 19], and the text transformer is initialized from BERT$_\mathrm{base}~$[21]. We explore two variants of ViTs: ViT-B/16 and ViT-L/16. Unless otherwise specified, all results reported in this paper as " BLIP" uses ViT-B. We pre-train the model for 20 epochs using a batch size of 2880 (ViT-B) / 2400 (ViT-L). We use AdamW [25] optimizer with a weight decay of 0.05. The learning rate is warmed-up to $3e$-4 (ViT-B) / $2e$-4 (ViT-L) and decayed linearly with a rate of 0.85. We take random image crops of resolution $224\times224$ during pre-training, and increase the image resolution to $384\times384$ during finetuning. We use the same pre-training dataset as [2] with 14M images in total, including two human-annotated datasets (COCO and Visual Genome [26]), and three web datasets (Conceptual Captions [8], Conceptual 12M [8], SBU captions [27]). We also experimented with an additional web dataset, LAION [28], which contains 115M images with more noisy texts[^1]. More details about the datasets can be found in the appendix.

[^1]: We only download images whose shorter edge is larger than 256 pixels from the original LAION400M. Due to the large size of LAION, we only use $1/5$ of it each epoch during pre-training.

::: {caption="Table 1: Evaluation of the effect of the captioner (C) and filter (F) for dataset bootstrapping. Downstream tasks include image-text retrieval and image captioning with finetuning (FT) and zero-shot (ZS) settings. TR / IR@1: recall@1 for text retrieval / image retrieval. ✓$_{B/L}$: captioner or filter uses ViT-B / ViT-L as vision backbone."}

:::

**Figure 4:** Examples of the web text $T_w$ and the synthetic text $T_s$. Green texts are accepted by the filter, whereas red texts are rejected.

::: {caption="Table 2: Comparison between beam search and nucleus sampling for synthetic caption generation. Models are pre-trained on 14M images."}

:::

::: {caption="Table 3: Comparison between different parameter sharing strategies for the text encoder and decoder during pre-training."}

:::

4.2 Effect of CapFilt

In Table 1, we compare models pre-trained on different datasets to demonstrate the efficacy of CapFilt on downstream tasks, including image-text retrieval and image captioning with finetuned and zero-shot settings.

When only the captioner or the filter is applied to the dataset with 14M images, performance improvement can be observed. When applied together, their effects compliment each other, leading to substantial improvements compared to using the original noisy web texts.

CapFilt can further boost performance with a larger dataset and a larger vision backbone, which verifies its scalability in both the data size and the model size. Furthermore, by using a large captioner and filter with ViT-L, performance of the base model can also be improved.

In Figure 4, we show some example captions and their corresponding images, which qualitatively demonstrate the effect of the captioner to generate new textual descriptions, and the filter to remove noisy captions from both the original web texts and the synthetic texts. More examples can be found in the appendix.

4.3 Diversity is Key for Synthetic Captions

In CapFilt, we employ nucleus sampling [29] to generate synthetic captions. Nucleus sampling is a stochastic decoding method, where each token is sampled from a set of tokens whose cumulative probability mass exceeds a threshold $p$ ($p=0.9$ in our experiments). In Table 2, we compare it with beam search, a deterministic decoding method which aims to generate captions with the highest probability. Nucleus sampling leads to evidently better performance, despite being more noisy as suggested by a higher noise ratio from the filter. We hypothesis that the reason is that nucleus sampling generates more diverse and surprising captions, which contain more new information that the model could benefit from. On the other hand, beam search tends to generate safe captions that are common in the dataset, hence offering less extra knowledge.

4.4 Parameter Sharing and Decoupling

::: {caption="Table 4: Effect of sharing parameters between the captioner and filter. Models are pre-trained on 14M images."}

:::

During pre-training, the text encoder and decoder share all parameters except for the self-attention layers. In Table 3, we evaluate models pre-trained with different parameter sharing strategies, where pre-training is performed on the 14M images with web texts. As the result shows, sharing all layers except for SA leads to better performance compared to not sharing, while also reducing the model size thus improveing training efficiency. If the SA layers are shared, the model's performance would degrade due to the conflict between the encoding task and the decoding task.

During CapFilt, the captioner and the filter are end-to-end finetuned individually on COCO. In Table 4, we study the effect if the captioner and filter share parameters in the same way as pre-training. The performance on the downstream tasks decreases, which we mainly attribute to confirmation bias. Due to parameter sharing, noisy captions produced by the captioner are less likely to be filtered out by the filter, as indicated by the lower noise ratio (8% compared to 25%).

5. Comparison with State-of-the-arts

Section Summary: This section compares the BLIP model to leading vision-language processing methods across tasks like finding matching images and text, generating image captions, answering questions about pictures, reasoning between sentences and image pairs, and holding visual conversations. BLIP generally outperforms competitors, such as ALBEF and SimVLM, by achieving higher accuracy on benchmarks like COCO and Flickr30K with less training data, simpler setups without heavy object detectors, and faster processing times. While it excels in most areas, gains from extra web images are limited in some reasoning tasks due to differences between training and real-world data.

::: {caption="Table 5: Comparison with state-of-the-art image-text retrieval methods, finetuned on COCO and Flickr30K datasets. BLIP $_\text{CapFilt-L}$ pre-trains a model with ViT-B backbone using a dataset bootstrapped by captioner and filter with ViT-L."}

:::

::: {caption="Table 6: Zero-shot image-text retrieval results on Flickr30K."}

:::

In this section, we compare BLIP to existing VLP methods on a wide range of vision-language downstream tasks[^2]. Next we briefly introduce each task and finetuning strategy. More details can be found in the appendix.

[^2]: we omit SNLI-VE from the benchmark because its test data has been reported to be noisy [31]

5.1 Image-Text Retrieval

We evaluate BLIP for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on COCO and Flickr30K [32] datasets. We finetune the pre-trained model using ITC and ITM losses. To enable faster inference speed, we follow [2] and first select $k$ candidates based on the image-text feature similarity, and then rerank the selected candidates based on their pairwise ITM scores. We set $k=256$ for COCO and $k=128$ for Flickr30K.

As shown in Table 5, BLIP achieves substantial performance improvement compared with existing methods. Using the same 14M pre-training images, BLIP outperforms the previous best model ALBEF by +2.7% in average recall@1 on COCO. We also perform zero-shot retrieval by directly transferring the model finetuned on COCO to Flickr30K. The result is shown in Table 6, where BLIP also outperforms existing methods by a large margin.

5.2 Image Captioning

::: {caption="Table 7: Comparison with state-of-the-art image captioning methods on NoCaps and COCO Caption. All methods optimize the cross-entropy loss during finetuning. C: CIDEr, S: SPICE, B@4: BLEU@4. BLIP $\text{CapFilt-L}$ is pre-trained on a dataset bootstrapped by captioner and filter with ViT-L. VinVL† and LEMON† require an object detector pre-trained on 2.5M images with human-annotated bounding boxes and high resolution (800 $\times$ 1333) input images. SimVLM$\mathrm{huge}$ uses 13 $\times$ more training data and a larger vision backbone than ViT-L."}

:::

We consider two datasets for image captioning: NoCaps [35] and COCO, both evaluated using the model finetuned on COCO with the LM loss. Similar as [4], we add a prompt "a picture of" at the beginning of each caption, which leads to slightly better results. As shown in Table 7, BLIP with 14M pre-training images substantially outperforms methods using a similar amount of pre-training data. BLIP with 129M images achieves competitive performance as LEMON with 200M images. Note that LEMON requires a computational-heavy pre-trained object detector and higher resolution (800 $\times$ 1333) input images, leading to substantially slower inference time than the detector-free BLIP which uses lower resolution (384 $\times$ 384) input images.

5.3 Visual Question Answering (VQA)

**Figure 5:** Model architecture for the downstream tasks. Q: question; C: caption; QA: question-answer pair.

::: {caption="Table 8: Comparison with state-of-the-art methods on VQA and NLVR$^2$. ALBEF performs an extra pre-training step for NLVR$^2$. SimVLM† uses 13 $\times$ more training data and a larger vision backbone (ResNet+ViT) than BLIP."}

:::

VQA [36] requires the model to predict an answer given an image and a question. Instead of formulating VQA as a multi-answer classification task [5, 6], we follow [2] and consider it as an answer generation task, which enables open-ended VQA. As shown in Figure 5(a), during finetuning, we rearrange the pre-trained model, where an image-question is first encoded into multimodal embeddings and then given to an answer decoder. The VQA model is finetuned with the LM loss using ground-truth answers as targets.

The results are shown in Table 8. Using 14M images, BLIP outperforms ALBEF by +1.64% on the test set. Using 129M images, BLIP achieves better performance than SimVLM which uses $13\times$ more pre-training data and a larger vision backbone with an additional convolution stage.

5.4 Natural Language Visual Reasoning (NLVR$^2$)

NLVR$^2$ [37] asks the model to predict whether a sentence describes a pair of images. In order to enable reasoning over two images, we make a simple modification to our pre-trained model which leads to a more computational-efficient architecture than previous approaches [2, 4]. As shown in Figure 5(b), for each transformer block in the image-grounded text encoder, there exist two cross-attention layers to process the two input images, and their outputs are merged and fed to the FFN. The two CA layers are intialized from the same pre-trained weights. The merge layer performs simple average pooling in the first 6 layers of the encoder, and performs concatenation followed by a linear projection in layer 6-12. An MLP classifier is applied on the output embedding of the [Encode] token. As shown in Table 8, BLIP outperforms all existing methods except for ALBEF which performs an extra step of customized pre-training. Interestingly, performance on NLVR$^2$ does not benefit much from additional web images, possibly due to the domain gap between web data and downstream data.

5.5 Visual Dialog (VisDial)

VisDial [38] extends VQA in a natural conversational setting, where the model needs to predict an answer not only based on the image-question pair, but also considering the dialog history and the image's caption. We follow the discriminative setting where the model ranks a pool of answer candidates [39, 40, 41]. As shown in Figure 5(c), we concatenate image and caption embeddings, and pass them to the dialog encoder through cross-attention. The dialog encoder is trained with the ITM loss to discriminate whether the answer is true or false for a question, given the entire dialog history and the image-caption embeddings. As shown in Table 9, our method achieves state-of-the-art performance on VisDial v1.0 validation set.

5.6 Zero-shot Transfer to Video-Language Tasks

: Table 9: Comparison with state-of-the-art methods on VisDial v1.0 validation set. VD-ViLBERT† [41] pre-trains ViLBERT [42] with additional VQA data.

Method	MRR $\uparrow$	R@1 $\uparrow$	R@5 $\uparrow$	R@10 $\uparrow$	MR $\downarrow$
VD-BERT	67.44	54.02	83.96	92.33	3.53
VD-ViLBERT†	69.10	55.88	85.50	93.29	3.25
BLIP	69.41	56.44	85.90	93.30	3.20

::: {caption="Table 10: Comparisons with state-of-the-art methods for text-to-video retrieval on the 1k test split of the MSRVTT dataset."}

:::

::: {caption="Table 11: Comparisons with state-of-the-art methods for video question answering. We report top-1 test accuracy on two datasets."}

:::

Our image-language model has strong generalization ability to video-language tasks. In Table 10 and Table 11, we perform zero-shot transfer to text-to-video retrieval and video question answering, where we directly evaluate the models trained on COCO-retrieval and VQA, respectively. To process video input, we uniformly sample $n$ frames per video ($n=8$ for retrieval and $n=16$ for QA), and concatenate the frame features into a single sequence. Note that this simple approach ignores all temporal information.

Despite the domain difference and lack of temporal modeling, our models achieve state-of-the-art performance on both video-language tasks. For text-to-video retrieval, zero-shot BLIP even outperforms models finetuned on the target video dataset by +12.4% in recall@1. Further performance improvement can be achieved if the BLIP model is used to initialize a video-language model with temporal modeling (e.g. replace our ViT with a TimeSformer [52]) and finetuned on video data.

::: {caption="Table 12: The original web texts are replicated to have the same number of samples per epoch as the bootstrapped dataset. Results verify that the improvement from CapFilt is not due to longer training time."}

:::

::: {caption="Table 13: Continue training the pre-trained model offers less gain compared to training a new model with the bootstrapped dataset."}

:::

6. Additional Ablation Study

Section Summary: Researchers tested whether CapFilt's benefits come from longer training time caused by the larger bootstrapped dataset, but by duplicating original web texts to match the length without adding quality, they found that extended training on noisy data does not improve results. They also examined continuing to train an existing pre-trained model on the bootstrapped dataset, but this approach failed to boost performance, reinforcing the need to start fresh, much like in knowledge distillation where new models learn directly from scratch.

In this section, we provide additional ablation experiments on CapFilt.

Improvement with CapFilt is not due to longer training. Since the bootstrapped dataset contains more texts than the original dataset, training for the same number of epochs takes longer with the bootstrapped dataset. To verify that the effectiveness of CapFilt is not due to longer training, we replicate the web text in the original dataset so that it has the same number of training samples per epoch as the bootstrapped dataset. As shown in Table 12, longer training using the noisy web texts does not improve performance.

A new model should be trained on the bootstrapped dataset. The bootstrapped dataset is used to pre-train a new model. We investigate the effect of continue training from the previous pre-trained model, using the bootstrapped dataset. Table 13 hows that continue training does not help. This observation agrees with the common practice in knowledge distillation, where the student model cannot be initialized from the teacher.

7. Conclusion

Section Summary: BLIP is a new framework for combining vision and language processing that achieves top results on various tasks involving image understanding and text generation. It trains a versatile model using a cleaned-up dataset created from large collections of imperfect image-text pairs by adding helpful synthetic captions and filtering out the bad ones, and this improved dataset is now publicly available for other researchers. The authors suggest ways to boost BLIP further, like repeating the data-cleaning process multiple times or using teams of models to generate and refine captions, and they encourage ongoing efforts to advance both the technology and the data quality in this field.

We propose BLIP, a new VLP framework with state-of-the-art performance on a wide range of downstream vision-language tasks, including understanding-based and generation-based tasks. BLIP pre-trains a multimodal mixture of encoder-decoder model using a dataset bootstrapped from large-scale noisy image-text pairs by injecting diverse synthetic captions and removing noisy captions. Our bootstrapped dataset are released to facilitate future vision-language research.

There are a few potential directions that can further enhance the performance of BLIP: (1) Multiple rounds of dataset bootstrapping; (2) Generate multiple synthetic captions per image to further enlarge the pre-training corpus; (3) Model ensemble by training multiple different captioners and filters and combining their forces in CapFilt. We hope that our paper motivates future work to focus on making improvements in both the model aspect and the data aspect, the bread and butter of vision-language research.

Appendix

Section Summary: The appendix outlines the technical details for fine-tuning the model's performance on various vision-language tasks, such as matching images to text descriptions, generating captions, answering questions about images, and handling dialogue about visuals, using specific settings like learning rates, batch sizes, and dataset splits from sources like COCO and Flickr30K. It features examples of clean synthetic captions replacing filtered-out web text to improve training quality and provides statistics on pre-training datasets, including the number of images and texts from collections like Visual Genome and LAION. The section also includes a list of references to related studies on image-text learning and data augmentation techniques.

A. Downstream Task Details

Table 14 shows the hyperparameters that we use for finetuning on the downstream vision-language tasks. All tasks uses AdamW optimizer with a weight decay of 0.05 and a cosine learning rate schedule. We use an image resolution of $384\times384$, except for VQA where we follow [4] and use $480\times480$ images. Next we delineate the dataset details.

Image-Text Retrieval. We use the Karpathy split [53] for both COCO and Flickr30K. COCO contains 113/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test.

Image Captioning. We finetune on COCO's Karpathy train split, and evaluate on COCO's Karpathy test split and NoCaps validation split. During inference, we use beam search with a beam size of 3, and set the maximum generation length as 20.

VQA. We experiment with the VQA2.0 dataset [54], which contains 83k/41k/81k images for training/validation/test. Following [2], we use both training and validation splits for training, and include additional training samples from Visual Genome. During inference on VQA, we use the decoder to rank the 3, 128 candidate answers [2, 55].

NLVR$^2$. We conduct experiment on the official split [37].

VisDial. We finetune on the training split of VisDial v1.0 and evaluate on its validation set.

: Table 14: Finetuning hyperparameters for downstream tasks.

Task	init LR (ViT-L)	batch size	#epoch
Retrieval	$1e^{-5}$ ($5e^{-6}$)	256	6
Captioning	$1e^{-5}$ ($2e^{-6}$)	256	5
VQA	$2e^{-5}$	256	10
NLVR$^2$	$3e^{-5}$	256	15
VisDial	$2e^{-5}$	240	20

**Figure 6:** Examples of the web text $T_w$ and the synthetic text $T_s$. Green texts are accepted by the filter, whereas red texts are rejected.

B. Additional Examples of Synthetic Captions

In Figure 6, we show additional examples of images and texts where the web captions are filtered out, and the synthetic captions are kept as clean training samples.

C. Pre-training Dataset Details

Table 15 shows the statistics of the pre-training datasets.

: Table 15: Statistics of the pre-training datasets.

	COCO	VG	SBU	CC3M	CC12M	LAION
# image	113K	100K	860K	3M	10M	115M
# text	567K	769K	860K	3M	10M	115M

References

[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.

[2] Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, C., and Hoi, S. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021a.

[3] Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779, 2021.

[4] Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., and Cao, Y. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.

[5] Chen, Y., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. UNITER: universal image-text representation learning. In ECCV, volume 12375, pp. 104–120, 2020.

[6] Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., and Gao, J. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, pp. 121–137, 2020.

[7] Sharma, P., Ding, N., Goodman, S., and Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Gurevych, I. and Miyao, Y. (eds.), ACL, pp. 2556–2565, 2018.

[8] Changpinyo, S., Sharma, P., Ding, N., and Soricut, R. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.

[9] Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.

[10] Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J. J., and Gao, J. Unified vision-language pre-training for image captioning and VQA. In AAAI, pp. 13041–13049, 2020.

[11] Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., and Wang, H. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), ACL, pp. 2592–2607, 2021b.

[12] Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[13] Xie, Q., Luong, M., Hovy, E. H., and Le, Q. V. Self-training with noisy student improves imagenet classification. In CVPR, pp. 10684–10695, 2020.

[14] Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data, 6:60, 2019.

[15] Kumar, V., Choudhary, A., and Cho, E. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245, 2020.

[16] Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., Tepper, N., and Zwerdling, N. Do not have enough data? deep learning to the rescue! In AAAI, pp. 7383–7390, 2020.

[17] Puri, R., Spring, R., Shoeybi, M., Patwary, M., and Catanzaro, B. Training question answering models from synthetic data. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), EMNLP, pp. 5811–5826, 2020.

[18] Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S., Bras, R. L., Wang, J., Bhagavatula, C., Choi, Y., and Downey, D. G-daug: Generative data augmentation for commonsense reasoning. In Cohn, T., He, Y., and Liu, Y. (eds.), EMNLP Findings, pp. 1008–1025, 2020.

[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[20] Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021.

[21] Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), NAACL, pp. 4171–4186, 2019.

[22] Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), ECCV, volume 8693, pp. 740–755, 2014.

[23] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32:8026–8037, 2019.

[24] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.

[25] Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

[26] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.

[27] Ordonez, V., Kulkarni, G., and Berg, T. L. Im2text: Describing images using 1 million captioned photographs. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q. (eds.), NIPS, pp. 1143–1151, 2011.

[28] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.

[29] Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. In ICLR, 2020.

[30] Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., and Liu, J. Large-scale adversarial training for vision-and-language representation learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, 2020.

[31] Do, V., Camburu, O.-M., Akata, Z., and Lukasiewicz, T. e-snli-ve: Corrected visual-textual entailment with natural language explanations. arXiv preprint arXiv:2004.03744, 2020.

[32] Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pp. 2641–2649, 2015.

[33] Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J. Vinvl: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529, 2021.

[34] Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., and Wang, L. Scaling up vision-language pre-training for image captioning, 2021.

[35] Agrawal, H., Anderson, P., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., and Lee, S. nocaps: novel object captioning at scale. In ICCV, pp. 8947–8956, 2019.

[36] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. VQA: visual question answering. In ICCV, pp. 2425–2433, 2015.

[37] Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., and Artzi, Y. A corpus for reasoning about natural language grounded in photographs. In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), ACL, pp. 6418–6428, 2019.

[38] Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M. F., Parikh, D., and Batra, D. Visual dialog. In CVPR, pp. 1080–1089, 2017.

[39] Gan, Z., Cheng, Y., Kholy, A. E., Li, L., Liu, J., and Gao, J. Multi-step reasoning via recurrent dual attention for visual dialog. In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), ACL, pp. 6463–6474, 2019.

[40] Wang, Y., Joty, S. R., Lyu, M. R., King, I., Xiong, C., and Hoi, S. C. H. VD-BERT: A unified vision and dialog transformer with BERT. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), EMNLP, pp. 3325–3338, 2020.

[41] Murahari, V., Batra, D., Parikh, D., and Das, A. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), ECCV, pp. 336–352, 2020.

[42] Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS, pp. 13–23, 2019.

[43] Zhu, L. and Yang, Y. Actbert: Learning global-local video-text representations. In CVPR, pp. 8746–8755, 2020.

[44] Patrick, M., Huang, P.-Y., Asano, Y., Metze, F., Hauptmann, A. G., Henriques, J. F., and Vedaldi, A. Support-set bottlenecks for video-text representation learning. In ICLR, 2021.

[45] Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., and Zisserman, A. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, pp. 9879–9889, 2020.

[46] Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, pp. 6787–6800, 2021.

[47] Bain, M., Nagrani, A., Varol, G., and Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.

[48] Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., and Liu, J. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, pp. 7331–7341, 2021.

[49] Yang, A., Miech, A., Sivic, J., Laptev, I., and Schmid, C. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pp. 1686–1697, 2021.

[50] Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., and Huang, H. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pp. 1999–2007, 2019.

[51] Le, T. M., Le, V., Venkatesh, S., and Tran, T. Hierarchical conditional relation networks for video question answering. In CVPR, pp. 9972–9981, 2020.

[52] Bertasius, G., Wang, H., and Torresani, L. Is space-time attention all you need for video understanding? In ICML, 2021.

[53] Karpathy, A. and Li, F. Deep visual-semantic alignments for generating image descriptions. In CVPR, pp. 3128–3137, 2015.

[54] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pp. 6325–6334, 2017.

[55] Kim, J., Jun, J., and Zhang, B. Bilinear attention networks. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), NIPS, pp. 1571–1581, 2018.