Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Madhuri Shanbhogue$^{}$ , Zhe Li$^{}$ , Shanfeng Zhang$^{}$ , Gustavo Hernández Ábrego$^{}$ , Shih-Cheng Huang$^{}$ , Aashi Jain$^{}$ , Daniel Salz, Sonam Goenka, Chaitra Hegde, Ji Ma, Feiyang Chen, Jiaxing Wu, Tanmaya Dabral, Babak Samari, Kevin Poulet, Daniel Cer, Kaifeng Chen, Paul Suganathan, Hui Hui, Jovan Andonov, Philippe Schlattner, Jay Han, Iftekhar Naim, Wing Lowe, Vladimir Pchelin, Albert Yang, Yi-Ting Chen, Zhongli Ding, Grace Zhang, Georg Heigold, Yichang Chen, Antoine Reveillon, Brendan McCloskey, Wenlei Zhou, Dahun Kim, Rui Meng, Emma Wang, Jack Zheng, Halley Fede, Zhen Yang, Keegan Mosley, Brian Potetz, Sahil Dua, Henrique Schechter Vera, Shen Gao, Hesen Zhang, Andreas Hess, Hengxuan Ying, Alberto Montes, Karan Gill, Min Choi, Sebastian Russo, Anja Hauth, Jinhyuk Lee, Michael Boratko, Megan Barnes, Vikram Rao, Claudiu Musat, Cyril Allauzen, Ehsan Variani, Shankar Kumar, Tom Bagby, Junyi Jiao, Yang Gu, Tengxin Li, Ayush Agrawal, Roberto Santana, Dev Nath, Stephen Karukas, Shuoxuan Han, Lucia Loher, Alice Twu, Nidhi Vyas, Siddharth Bhai, Frank Palma Gomez, Wangyuan Zhang, Chaoren Liu, Jizheng Yang, Steve Qiu, Shijie Zhang, Sujay Kulkarni, Sascha Rothe, Sean Nakamoto, Raphael Hoffmann, Zach Gleicher, Yunhsuan Sung, Qin Yin, Tom Duerig and Mojtaba Seyedhosseini
Gemini Embedding Team, Google$^{1}$

Abstract

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields — from astronomy and bioscience to fine arts and the culinary arts — establishes it as a highly reliable, out-of-the-box representation even for specialized domains.

$^{1}$ See Contributions and Acknowledgments section.
$^{*}$ Equal contributions.

Executive Summary: Gemini Embedding 2 is a new embedding model that produces a single vector representation for any combination of text, images, video, and audio, including interleaved inputs. Prior multimodal embedders relied on separate encoders and paired data, which limited their ability to capture rich cross-modal interactions and handle mixed inputs common in real documents, videos, and enterprise search. As multimodal foundation models advance, organizations need embeddings that deliver strong performance across retrieval, recommendation, and RAG without brittle pipelines that convert audio to text or images to captions.

The work set out to build and evaluate a single model that generalizes across unimodal, cross-modal, and multimodal tasks while matching or exceeding specialized systems. The team initialized the model from Gemini, applied bidirectional attention, and trained it with large-scale contrastive learning. Training occurred in two main stages—pre-fine-tuning on image-text-code data followed by fine-tuning on a broad mixture of text, code, document, image, audio, and video tasks—plus model souping to combine checkpoints. They evaluated on academic and enterprise benchmarks spanning retrieval, code, multilingual text, and specialized domains such as astronomy, microscopy, and recipes.

The model reaches state-of-the-art results on major benchmarks, scoring 62.9 R@1 on MSCOCO image-text retrieval, 68.8 NDCG@10 on Vatex video retrieval, 69.9 on the multilingual MTEB suite, and 84.0 on MTEB Code tasks. Native audio processing improves retrieval over ASR transcription by roughly 3.6 points on average and widens the gap in cross-lingual settings. Zero-shot performance in specialized domains is consistently high and far less variable than competing models, often doubling prior results in astronomy and microscopy while exceeding 90 R@5 on recipe data. Adding targeted video data lifts in-domain scores but can slightly reduce out-of-domain performance; model souping restores balance.

These outcomes indicate that a single model can now replace multiple modality-specific embedders and eliminate intermediate transcription or captioning steps, lowering latency, error propagation, and maintenance costs for search, recommendation, and agentic RAG systems. The robust zero-shot behavior across domains also reduces the need for per-domain fine-tuning. Organizations can deploy the model immediately for document retrieval, video search, and multimodal RAG, and they should test it first on their own data distributions before committing to narrower alternatives.

Further gains are likely from incorporating ranking signals and end-to-end training on full RAG workflows. The main limitations are sensitivity of the multi-task recipe to sampling rates and batch sizes, plus the need for additional evaluation frameworks that measure interleaved multimodal retrieval at scale. Results rest on extensive academic and internal benchmarks and show consistent patterns, supporting high confidence for most enterprise use cases while advising light validation on proprietary data.

1. Introduction

Section Summary: Embedding models turn data like text, images, video, and audio into dense vectors that capture meaning for tasks such as search and recommendations. Current approaches often process modalities separately before combining them, which limits how well they handle mixed inputs or interactions between different types of content. Gemini Embedding 2 overcomes these issues by using a multimodal large language model to create unified representations for any combination of inputs, achieving strong results across retrieval benchmarks through multi-task training.

Embedding models provide dense vector representations capturing semantic information that is crucial for adaptation in a wide range of downstream tasks. With foundational models being natively multimodal and powered with exceptionally growing capabilities, it is important to ensure embedding models capture semantic information within and across all modalities in a coherent manner. Such general-purpose embedding models will also enhance the performance across a broad spectrum of applications like video recommendations and document search which are rich in information across different modalities but since the contained modalities are not inherently homogenous, they can benefit from having rich semantic information from across all modalities.

Existing multimodal embedding models like CLIP ([1]), ALIGN ([2]), SigLIP 2 ([3]), CoCa ([4]) embed heterogenous modalities by using paired cross-modal data and training modality-specific encoders to encode them into a unified vector space. This late-fusion approach results in good unimodal and cross-modal capabilities but has a key limitation in handling mixed-modality inputs and lacks richness since it does not utilize interactions between modalities. With advances in Multimodal Large Language Models (MLLMs), it is now possible to achieve semantically richer embeddings enabled by the deep fusion of cross-modal interactions.

**Figure 1:** Conceptual overview of the Gemini Embedding 2 workflow. The model natively processes heterogeneous inputs—text, images, video, audio, documents, and their combinations—mapping them into a single, unified high-dimensional vector space where cross-modal semantic relationships are preserved.

In this work, we introduce a generalizable multimodal embedding model that embeds video, audio, image, text modalities, and any arbitrary combination thereof into a single representation space. The multimodal Gemini Embedding 2 is trained by leveraging Gemini's ([5]) capabilities and utilizing multi-task training with a diverse set of tasks resulting in a model that captures various interactions between modalities. Figure 1 shows a high-level representation of how multimodal Gemini Embedding 2 maps the heterogenous sources into a unified vector space. The curated set of tasks help the model generalize across a wide variety of enterprise use cases like document retrieval, video recommendation, audio-based search, and RAG applications ([6]). Crucially, enabling the model to handle interleaved sequences of images, text, and video facilitates complex, novel retrieval paradigms—such as zeroing in on specific temporal events in a video using combined visual and textual prompts. Using Gemini's capabilities we also show that native audio understanding and native multimodal understanding outperforms text-based alternatives like ASR or captioning.

**Figure 2:** Gemini Embedding 2 shows strong performance across multimodal retrieval tasks spanning image, text, video, and document modalities. $^*$MTEB number is reported for Voyage-3.5 since Voyage-3.5-multimodal does not report MTEB.

We evaluate comprehensively on a wide variety of benchmarks, both academic-focused and enterprise-focused. As shown in Figure 2, our model achieves state-of-the-art performance compared to other models. For evaluating the text embedding capabilities, we rely on the Massive Multilingual Text Embedding Benchmark (MMTEB) ([7]) which consists of multi-lingual tasks spanning key downstream embedding use cases like retrieval, clustering, classification, etc. Gemini Embedding 2 achieves state-of-the-art performance on multilingual and code surpassing existing models on the leaderboard. We demonstrate strong numbers on a broad range of cross-modal retrieval benchmarks like MSCOCO ([8]), Flickr30k ([9]), and MSR-VTT ([10]). We also demonstrate the model's ability to generalize to most multimodal retrieval tasks in general as well as specialized domains.

2. Related Work

Section Summary: Recent work has moved text embedding models from simple encoder-based systems like BERT to powerful decoder-only large language models that are instruction-tuned or distilled from larger teachers, enabling strong performance across search and classification tasks. At the same time, researchers have extended these ideas to multimodal settings that combine text, images, and documents in one shared space, and they have introduced techniques to give causal language models bidirectional understanding and to handle large enterprise documents such as PDFs. Although these advances tackle individual challenges, most prior models address them separately rather than within a single unified system.

Large Language Models as Text Embedders

The paradigm of text embedding models has matured from relying on purely encoder-only architectures (e.g., BERT ([11]), RoBERTa ([12])) to utilizing decoder-only or massive LLM backbones. Models such as the BGE ([13]) series and E5 ([14]) established instruction-tuned representations, effectively unifying downstream tasks—like semantic search, clustering, and classification—into a single model via task-specific prefixes. Recognizing the rich semantic understanding capabilities of LLMs, recent research has focused heavily on LLM-augmented training and distillation. The Gecko model ([15]) demonstrated that lightweight, highly-efficient retrievers can be trained through a two-step distillation pipeline that leverages the vast knowledge of massive LLM teachers. Concurrently, NV-Embed ([16]) achieved strong performance on the MMTEB leaderboard ([17]) by transforming decoder-only LLMs into generalist embedders using instruction-tuned contrastive learning and the aggressive integration of synthetic, non-retrieval data. Gemini Embedding ([18]) demonstrated state-of-the-art performance on the MMTEB leaderboard due to utilizing synthetic data and excellent generalization to multilingual tasks through the powerful pre-training of Gemini.

Evolution of Multimodal Embedders

Early multimodal embedding paradigms, exemplified by dual-tower models like CLIP ([1]) and ALIGN ([2]), were limited by their reliance on narrow contrastive learning objectives over simple image–text pairs. Today, the field is gravitating towards multimodal architectures capable of mapping text, code, images, structured documents, audio, and video into a single, unified, continuous semantic space. Embedding models are trained by extending existing MLLMs for retrieval via multi-stage contrastive training thereby enabling excellent cross-modal retrieval capabilities. SAIL-Embedding ([19]) further illustrates this shift by employing a content-aware progressive training methodology mapping multimodal representations seamlessly into industrial recommendation environments (e.g., sequence-to-item prediction). Similarly, Amazon Nova MME ([20]) and SigLIP 2 ([3]) have demonstrated strong performance in unifying disparate modalities for cross-modal retrieval workflows.

Architectural Adaptations for Bidirectional Attention

While causal (autoregressive) LLMs excel in generative tasks, their inherently unidirectional attention mechanism imposes unnecessary limits when generating dense, context-aware embeddings. Several innovative frameworks have emerged to circumvent this limitation. MoCa ([21]) directly addresses this by introducing modality-aware continual pre-training, utilizing a joint reconstruction objective that denoises interleaved text and image inputs to force bidirectional context-aware reasoning on top of a causal backbone. Similarly, MM-Embed ([22]) tackles the problem of modality bias through modality-aware hard negative mining, ensuring that embedding models do not disproportionally favor text-to-text resonance at the expense of cross-modal relevance.

Adaptation to Enterprise Use Cases

With enterprise and agentic needs scaling to massive contexts and increasingly focused on documents, modern embedders are required to ingest vast informational payloads efficiently. Models utilize specialized visual-document processing (such as tiled mixtures of vision encoders) to embed complex PDFs, charts, and tables which causes the RAG system's quality to be dependent on various parts of the processing pipeline like chunking strategies etc.

While these preceding architectures have successfully pushed the boundaries of multi-stage distillation, LLM backbone adaptation, and applications to enterprise use cases, they predominantly address these axes in isolation. Gemini Embedding 2 unifies these capabilities into a single model that spans a breadth of use cases across which the model can be used out-of-the-box.

3. Multimodal Gemini Embedding

Section Summary: The Multimodal Gemini Embedding 2 model builds on the Gemini foundation model to produce unified vector representations of text, images, video, audio, and their combinations for downstream uses such as retrieval and classification. Raw inputs are tokenized, passed through a bidirectional transformer, averaged via mean pooling, and projected to a target embedding dimension, with support for multiple sizes through a specialized loss adaptation. Training follows a multi-stage, multi-task recipe that begins with large-scale pre-fine-tuning on noisy data using a contrastive objective with in-batch negatives, progressively refining the model across single-modality, multimodal, and cross-modal tasks.

In this section we provide technical details of the Multimodal Gemini Embedding 2 in terms of the model architecture, the objective function, and the training recipe.

3.1 Model Architecture

The Gemini Embedding 2 model is built to create holistic representations of inputs of different modalities and of inputs that combine such modalities. These representations can be used in diverse downstream tasks including retrieval, clustering, classification, and ranking. Gemini Embedding 2 leverages the multimodal and cross-modal power of Gemini to build such representations. The embedding model is initialized from Gemini and further fine-tuned with task-specific, modality-specific, and cross-modality training. This allows Gemini Embedding 2 to build representations on top of the vast knowledge already present in the Gemini parameters. In this sense, initializing Gemini Embedding 2 from Gemini can be understood as the "pre-training" stage of the embedding model.

Gemini Embedding 2 constructs representations in a manner similar to our previous Gemini Embedding model ([18]), but with the important difference that different modalities require different steps to convert the raw format into a sequence of tokens. In Gemini Embedding 2 we leverage Gemini to do these types of data and format conversions. In this way, the model can take as input raw images, video or audio in the formats natively supported by Gemini.

After tokenization, an input sequence $\mathbf{T}$ of $L$ tokens is processed by $\mathcal M$, a transformer with bidirectional attention initialized from Gemini, producing a sequence of token embeddings $\mathbf{T}\mathrm{embed} = \mathcal{M}(\mathbf{T}) \in \mathbb{R}^{L \times d\mathcal{M}}$, where $d_\mathcal{M}$ is the transformer model dimension. To generate a single embedding representing all the information in the input, a pooler $\mathcal{P}$ is applied, $\mathbf{P}\mathrm{embed} = \mathcal{P}(\mathbf{T}\mathrm{embed}) \in \mathbb{R}^{d_\mathcal{M}}$. Prior research ([23]) demonstrated that simple pooling strategies can be effective in model adaptation. Therefore we choose mean pooling, and simply average the token embeddings along the sequence axis. Finally, a randomly initialized linear projection $\mathit{f}$ is applied to scale the embedding to the target dimension, $\mathbf{E} = \mathit{f}(\mathbf{P}_\mathrm{embed}) \in \mathbb{R}^{d}$, where $d$ is the output embedding dimension.

3.2 Training Objective

The multimodal nature of Gemini Embedding 2 requires a multi-task and multi-stage type of training. This way different modalities can be trained in separate tasks. We used a multitude of single-modality tasks, multimodal tasks, as well as cross-modal tasks.

Similar to our previous version ([18]), the multimodal Gemini Embedding 2 model was trained with a noise-contrastive estimation (NCE) loss with in-batch negatives ([24]). The exact loss differs slightly depending on the task being trained. In general, a training example includes a query $q_i$, a positive target $p_i^+$ and (optionally) a hard negative target $p_{i}^-$. In text-only training tasks, each example also has a prescribed task string $t$, for example "question answering" or "fact checking", describing the nature of the task. During training, we randomly drop off the task string $t$ to augment the robustness of the model to different modality inputs where the task strings are not used. The query and passages are embedded as vectors in $\mathbb R^d$:

$ \mathbf q_i = f(\texttt{mean_pool}(\mathcal M(t \oplus q_i))), \quad \mathbf p^\pm_i = f(\texttt{mean_pool}(\mathcal M(p^\pm_i))). $

Given a batch of size $B$ the loss applied to these embeddings is as follows:

$ \mathcal L = \frac 1 B \sum_{i=1}^B \left[-\log \frac{e^{\operatorname{sim}(\mathbf q_i, \mathbf p_i^+)/\tau}}{e^{\operatorname{sim}(\mathbf q_i, \mathbf p_i^+)/\tau} + e^{\operatorname{sim}(\mathbf q_i, \mathbf p_{i}^-)/\tau} + \sum_{j=1}^B \texttt{mask}(i, j) e^{\operatorname{sim}(\mathbf q_i, \mathbf p_j^+) / \tau}}\right] $

where $\operatorname{sim}(\mathbf x, \mathbf y)= \mathbf x^\top \mathbf y / \lVert \mathbf x \rVert \lVert \mathbf y \rVert$ is cosine similarity, and

$ \texttt{mask}(i, j) = \begin{cases} 0 \quad & \text{if }q_i=q_j \text{ or } p_i^+=p_j^+, \ 1 \quad & \text{otherwise.} \end{cases} $

This masking term is particularly relevant for classification tasks, where the number of targets (labels) is small. It should be noted that the second term in the denominator is omitted if no hard negatives are provided.

In order to support different dimensions of embeddings with a single model, we adapt the above loss using MRL ([25]) into $k$ separate losses across $k$ overlapping sub-dimensions of the embedding dimensions (e.g. multi-loss training with one loss for the first 768 embedding dimensions, another for the first 1, 536 dimensions, and so on). Gemini Embedding 2 provides $d=3{,}072$ dimensional embeddings, with the MRL support optimized for 768 and 1, 536 dimensions.

3.3 Recipe

We heavily lean on the multi-task nature of our training setup to let the model learn from each of the different tasks that, as mentioned in Section 3.2, contribute in different ways to build the unified embedding space across the different modalities. We adopt the multi-stage training from previous models like Gecko ([15]) and Gemini Embedding ([18]) as described below.

Pre-Fine-Tuning (PFT)

To adapt the parameters in the model from auto-regressive generation to encoding, this stage uses as training a large number of potentially noisy query–target pairs in a multi-task setup. Further, in this stage we find it beneficial to use large batch sizes which provide more stable gradients, mitigating the impact of the noisy inputs. During this stage, only image, text and code tasks are used in our multi-task setup. The examples from each different task are sampled at pre-specified sampling rates to build training batches of a single task.

Fine-Tuning (FT)

The fine-tuning stage for this model is based on training with a large number of text, code, document, image, audio, and video tasks. Many, but not all, of the tasks in this fine-tuning include examples that contain query, target, and hard negative target triplets. For this training stage we found it beneficial to tune batch sizes for each task to improve quality on corresponding evaluations. In this stage we also sample examples from one single task to build the training batches. The alignment between modalities is based on training multiple single-modality batches as well as cross-modality ones. As in the previous stage, training with all the different tasks and modalities require a multi-task training setup and the sampling rates of each of the different tasks are defined empirically. Empirically, we found that balancing overall performance across all modalities was sensitive to hyper-parameters like sampling rates and batch sizes in the multi-task setup.

Model Soup

To systematize the combination of different checkpoints and obtain additional generalization performance across the different modalities, we average the parameters obtained from individual fine-tuning runs. We experimented with different combinations of parameters, including averaging checkpoints from the same training run ([26]), from different training runs ([27]), as well as various weighted averages.

4. Evaluation

Section Summary: Gemini Embedding 2 was tested on a wide range of benchmarks covering text, image, video, and audio tasks, where it delivered leading results in both multimodal and single-modality settings without needing task-specific prompts. It outperformed competing models on retrieval tests involving images, videos, and documents, including strong results on challenging captioning and layout-understanding tasks. The model also surpassed prior text-only versions and other multimodal systems on language understanding and code retrieval benchmarks, showing that added multimodal support enhanced rather than reduced its text performance.

We rigorously evaluate Gemini Embedding 2 across a comprehensive suite of multimodal and unimodal benchmarks, demonstrating its state-of-the-art capabilities in text, image, video, and audio understanding. Unlike competing models that often rely on brittle, task-specific instructions, Gemini Embedding 2 provides a robust, unified latent space that delivers high performance in zero-shot settings without the need for manual prompt engineering.

4.1 Multimodal Retrieval

::: {caption="Table 1: Comparison of embedding models on retrieval benchmarks. Our model shows strong performance across a variety of unimodal, cross-modal, and multimodal retrieval tasks. †: Average over intersection of tasks where the metrics are available for all models. Modality abbreviations: V=Video, A=Audio, I=Image, T=Text. ‡: Reported by accessing available APIs unless self-reported."}

:::

We evaluate Gemini Embedding 2 against other multimodal embedding models — Voyage-3.5-multimodal ([36]), Amazon Nova MME ([20]), and Google's legacy model multimodalembedding@001 ([37]) — across a diverse suite of unimodal, cross-modal and multimodal retrieval benchmarks spanning image, text, and video modalities (see Table 1). For unimodal image evaluation, we utilize the Google Universal Embedding Challenge (GUIEC) ([38]) which requires instance-level retrieval over a large-sized index consisting of 200, 000 images. We also evaluate cross-modal retrieval quality on image-to-text and text-to-image benchmarks including MSCOCO ([8]), Flickr30K ([9]), DOCCI ([30]) and TextCaps ([31]). These tasks range from challenging the models on basic image captioning to long captions including spatial reasoning and scene text understanding. We embed the images and texts separately using Gemini Embedding 2 and then retrieve using cosine similarity between queries and documents over the whole test set. We also evaluate on multimodal embedding capabilities by embedding images and texts together. We do visual question answering as a retrieval evaluation using EncyclopedicVQA ([34]) where we embed the image along with the question to retrieve the correct answer. For text-to-video retrieval, we evaluate on Vatex ([32]), MSR-VTT ([39]), and YouCook2 ([33]) where the video is embedded at 1 FPS up to 32 frames.\

Gemini Embedding 2 achieves the highest global mean score and leads decisively on unimodal image retrieval, text-to-image, image-to-text, and text-to-video tasks, with particularly strong results on long-caption benchmarks such as DOCCI and TextCaps. The training mixture shows very good capabilities to generalize to third-party evaluation tasks like Vatex, MSR-VTT, and YouCook2 despite not including any specific in-domain training splits of those datasets.

On the ViDoRe Benchmark V2 ([35]) document retrieval benchmark, as presented in Table 1 Gemini Embedding 2 achieves a score of 64.9, delivering competitive performance in a task that demands understanding of page-level visual structure, layout, and embedded text. This places Gemini Embedding 2 ahead of Amazon Nova MME (60.6) and within close range of Voyage-3.5-multimodal (65.5). Gemini Embedding 2 also stands out as one of only two models in this comparison to support the full Video/Audio/Image/Text modality set (alongside Amazon Nova MME), making its document retrieval performance particularly noteworthy given the breadth of tasks it is simultaneously optimized for.

4.2 MMTEB

::: {caption="Table 2: Comparison of multimodal and text-only embedding models on the Massive Text Embedding Benchmark, MTEB(Multilingual), MTEB Code v1, and CoIR benchmarks. Modality abbreviations: V=Video, A=Audio, I=Image, T=Text. ^*: only self-reported the aggregated MTEB(Multilingual) mean score. †: $\textsc{voyage-3}$.5 for MTEB(Multilingual) and $\textsc{voyage-code-3}$ in CoIR. ‡: Results were not reported."}

:::

The multilingual benchmark MMTEB ([7]) consists of a large collection of individual evaluation tasks covering 250+ languages and 10 task types: Bitext Mining, Classification, Clustering, Instruction Retrieval, Multilabel Classification, Pair Classification, Reranking, Retrieval, STS, and Summarization. Gemini Embedding 2 overall performance, along with the performance of other multimodal models, is presented in Table 2 where we also include the modalities supported by each model.

The MMTEB results demonstrate that Gemini Embedding 2 outperforms other multimodal models on this text-only benchmark, indicating that its expanded multimodal capabilities do not compromise its performance on purely textual tasks. Relative to our previous text-only Gemini Embedding model, the new multimodal Gemini Embedding 2 shows stronger performance surpassing the Mean (by task) of 68.32 of our previous model with an equivalent of 69.9. Moreover, our multimodal Gemini Embedding 2 sets a new state-of-the-art performance level in task-specific evaluations such as MTEB Code v1 ([7]), which consists of 12 code retrieval tasks in 15 coding languages, and the Code Information Retrieval benchmark, CoIR ([40]), which includes 10 of coding retrieval tasks in 9 coding languages. Table 2 also shows that our new Gemini Embedding 2 model achieves performance that is considerably better in these benchmarks than our previous Gemini Embedding text-only model. Notably, Gemini Embedding 2 is also considerably better relative to other text-only models and also better than domain-specific models such as voyage-code-3.

4.3 MSEB

\begin{tabular}{@ l c c c @}
\toprule
\multirow{2}{*}{\textbf{Model Setup}} & \textbf{Average} & \multicolumn{2}{c}{\textbf{Retrieval Split (mrr@10)}} \\
\cmidrule(lr){3-4}
 & & Passage In-Lang & Passage Cross-Lang \\
\midrule
Gemini Embedding 2 w/ ASR & 70.40 & 73.58 & 67.55 \\
Gemini Embedding 2 w/ Native Audio & \textbf{73.99} & \textbf{75.58} & \textbf{72.56} \\
\bottomrule
\end{tabular}

To rigorously evaluate the auditory capabilities of Gemini Embedding 2, we benchmark the model on the Massive Sound Embedding Benchmark (MSEB) ([41]). We focus our evaluation on the retrieval split of MSEB. The model is given a spoken query and the task is to find the most relevant information for the query in a large corpus of text documents.

4.3.1 Experimental Setup

A persistent challenge in multimodal retrieval is the bottleneck introduced by standard pipelined approaches, where audio is typically transcribed to text before producing the embeddings. To isolate the impact of our unified multimodal architecture, we juxtapose two distinct input modalities:

  1. Gemini Embedding 2 with ASR: A cascaded baseline where the raw audio signal is first transcribed into text via an Automatic Speech Recognition (ASR) system, and the resulting text is subsequently encoded.
  2. Gemini Embedding 2 with audio: Our proposed approach, which directly processes raw audio inputs without intermediate textual transcription.

We utilize Mean Reciprocal Rank at 10 (mrr@10) as our principal evaluation metric. The retrieval setup is further stratified into two key partitions to assess generalization: PassageInLang (intra-lingual retrieval within the same language) and PassageCrossLang (cross-lingual retrieval).

4.3.2 Results

As shown in Table 3, the results demonstrate that utilizing native audio processing significantly enhances retrieval performance over the ASR baseline. As shown, Gemini Embedding 2 with native audio achieves an average retrieval mrr@10 of 73.99, yielding a substantial improvement over the ASR-based approach (70.40).

Breaking down the task partitions, we observe consistent gains across varying degrees of linguistic complexity:

PassageInLang:

Direct audio modeling improves same-language retrieval by +2.0 points (75.58 vs. 73.58). The performance gap between the cascade baseline and Gemini Embedding 2 highlights a structural flaw in pipeline architectures. The cascade system (ASR → Retrieval) in this experiment—suffers heavily from error propagation. If the ASR system misinterprets an ambiguous audio snippet and commits to an incorrect text output, the downstream retrieval system faces a fundamentally altered query, leading to poor search results. Gemini Embedding 2 overcomes this bottleneck by natively encoding the raw audio directly. Instead of forcing a "hard" textual decision (e.g., "recognize speech" vs. "wreck a nice beach"), the resulting embedding preserves the inherent ambiguity of the original acoustic signal. This robust, continuous representation gives the system a significantly better chance of surfacing the correct retrieval results by preserving rich acoustic cues (e.g., prosody, intonation, and emphasis).

PassageCrossLang:

Notably, the performance delta widens in cross-lingual setups. Native audio embeddings yield a striking +5.01 point enhancement (72.56 vs. 67.55). The dramatic jump in PassageCrossLang validates that the modality-agnostic latent space of Gemini Embedding 2 deeply aligns semantic features regardless of the source audio's spoken language, generalizing robustly beyond the strict phonetic bounds parameterized by an intermediate ASR transcriber.

In aggregate, the MSEB benchmark corroborates that Gemini Embedding 2 successfully models contiguous raw audio, effectively consolidating a holistic representation that significantly outperforms transcription-reliant bottlenecks.

5. Ablation Study

Section Summary: The ablation study examines how various training elements contribute to Gemini Embedding 2's strong results across diverse tasks and domains. Experiments show that generating high-quality synthetic data with Gemini, applying targeted fine-tuning (especially with video examples), and including domain-specific data each deliver clear gains, with the model maintaining more consistent performance than alternatives even in specialized areas like astronomy or microscopy. Overall, these components combine to produce a robust multimodal embedding system that generalizes reliably without sharp drops on unfamiliar data.

To better understand how Gemini Embedding 2 achieves great performance across many different tasks and languages, we provide a systematic analysis of our training recipe.

::: {caption="Table 4: Image-to-Text Retrieval (R@5) performance across various Specialized Domains."}

:::

5.1 Generalization to specialized domains

To rigorously assess the versatility and multimodal alignment of Gemini Embedding 2 in specialized contexts, we evaluated its zero-shot image-to-text retrieval capabilities across a diverse suite of domain-specific datasets. To ensure a comprehensive evaluation, we selected datasets corresponding to distinct real-world applications: microscopy and bioscience (MicroVQA [42]), fine art (ArtCap [43]), astronomy (AstroLLaVA [44]), and culinary arts (Recipe1M [45]). Formulated as a standard Recall@5 (R@5) benchmark, we compared our model against an array of open-source and proprietary vision-language models (see Table 4).

Our findings demonstrate that Gemini Embedding 2 achieves state-of-the-art performance across all evaluated domains, frequently establishing substantial margins of improvement over existing baselines. For instance, in astronomy (AstroLLaVA) and microscopy (MicroVQA), Gemini Embedding 2 achieves a R@5 of 64.4 and 79.3, respectively, effectively doubling the performance of these baselines in astronomy, and outperforming them by over 48% in microscopy. On the Recipe1M dataset, it breaks the 90.0 barrier for retrieving both ingredients (90.2) and instructions (92.1), decisively outperforming the next-best model, SigLIP2-Giant (81.2 and 80.4).

Beyond absolute performance margins, our evaluation highlights a notable difference in cross-domain consistency. While the performance of existing model families often fluctuates significantly depending on the target domain, Gemini Embedding 2 maintains a robust, general-purpose alignment. As shown in Table 4, many baseline architectures exhibit incidental performance peaks and valleys across different specialized domains. For instance, the TIPS ([46]) model family demonstrates strong alignment in the fine art domain, with TIPS-G14 achieving a R@5 of 65.2 on ArtCap. Yet its performance is comparatively much lower on microscopic biological imagery (20.0 on MicroVQA). Similarly, while the SigLIP2 lineage excels at the Recipe1M dataset (scoring up to 81.2), it struggles to capture the visual semantics of ArtCap (dropping to 8.4). Conversely, Gemini Embedding 2 does not exhibit these sharp, domain-dependent fluctuations. Instead, it offers a consistently reliable multimodal embedding space that generalizes predictably across a diverse array of highly specialized tasks.

Ultimately, these results underscore the unprecedented robustness of Gemini Embedding 2 's representations out-of-the-box. Users—ranging from bench biologists and astrophysicists to culinary platforms and digital humanities researchers—can readily integrate Gemini Embedding 2 into their diverse workflows to power highly-accurate, domain-aware, multimodal retrieval systems.

5.2 Impact of synthetic data

The text-only Gemini Embedding model ([18]) showed the effectiveness of the Gemini model to improve the quality of the text data used to train the Gemini Embedding model. In this new Gemini Embedding 2 model, we also used the power of Gemini to improve the quality of the data used to train the model. We illustrate this with some of the MTEB Code tasks as example of the impact of Gemini when it is used to synthesize high-quality training data. The results are shown in Table 5. Considering the results of the text-only Gemini Embedding model as baseline, the equivalent results of the multimodal Gemini Embedding 2 model show some improvement, even before adding any synthetic data. This is remarkable because, as it has been observed in other text-only evaluations, the new multimodal model surpasses the performance of our previous text-only version (refer to Table 2 for an MMTEB comparison). Adding synthetic data generated with Gemini, results in very noticeable improvements in the three MTEB Code tasks subject of this analysis, especially in the CodeFeedbackMT ([47]) task and also in the SyntheticText2SQL and CodeFeedbackST ([40]) ones. Overall, the use of synthetic data gives a remarkable improvement of +15.81 points in average over our previous Gemini Embedding model in these challenging code retrieval tasks.

::: {caption="Table 5: Results on selected MTEB Code v1 tasks using synthetic datasets. Ablation models exclude souping."}

:::

5.3 Impact of Fine-Tuning and Pre-Fine-Tuning

**Figure 3:** Comparing Pre-Fine-Tuning (PFT) and Fine-Tuning (FT) checkpoints on multimodal evals.

We compare the performance of the Pre-Fine-Tuning (PFT) checkpoint and the final Fine-Tuning (FT) checkpoint across various image and video understanding tasks. As shown in Figure 3, FT improves performance over PFT across almost all evaluated benchmarks. The improvements on image tasks, while consistent, are relatively modest. The most significant improvements are concentrated in the video evaluations due to the additional video training data in FT.

5.4 Impact of In-Domain Video Data

::: {caption="Table 6: Summary of video metrics (NDCG@10 in %) for fine-tuned and souped models. The Delta columns indicate absolute percentage point differences relative to the Gemini Embedding 2 baseline. Adding targeted data improves in-domain performance but can slightly degrade out-of-domain tasks (e.g., YouCook2 dipping by 0.6%), whereas model souping effectively balances these task-specific gains with the original model's robustness."}

:::

Comparing the fine-tuned models built on top of Gemini Embedding 2, Table 6 shows that the evaluation metrics are highly sensitive to the addition of targeted, in-domain data. Note that we add the in-domain data into the finetuning mixture and train one epoch of the added data. With only a few thousand steps of training and modest O(k) data quantities, we can drive significant improvements in targeted tasks (e.g., adding MSR-VTT and Vatex's training splits pushes MSR-VTT to 76.1% and Vatex to 79.5%). However, this narrow focus can lead to slight degradations in out-of-domain tasks (such as YouCook2 dipping to 55.3%). Interestingly, the newly fine-tuned weights remain highly compatible with the original base model through model souping. Simple interpolation of the souping weights (such as the $2 \times \text{Gemini Embedding 2} + 1 \times \text{fine-tuned}$ or $1 \times \text{Gemini Embedding 2} + 1 \times \text{fine-tuned}$ mixtures) effectively brings back the video performance gains, in several cases yielding better results across the board than the baseline by balancing task-specific knowledge with the robustness of the original model.

6. Future Work

Section Summary: Gemini Embedding 2’s built-in ability to handle text, images, video, and other formats together opens the door to practical business tools such as smart document search, video suggestions, and mixed-media retrieval without extra conversion steps. The authors suggest that feeding additional signals from search engines, such as ranking data, into these models could further boost performance, and they see value in training complete retrieval systems end-to-end for specific company needs. They also call on outside researchers to develop new ways to test and measure these expanding multimodal capabilities.

The vast native multimodal capabilities of Gemini Embedding 2 unlocks the potential for numerous enterprise use cases like agentic RAG, video recommendation, interleaved multimodal retrieval, etc. without the need for conversion to intermediate modalities. With LLM backbones being highly capable, we believe including other signals from search systems like ranking can be hugely beneficial to improving the retrieval capabilities of embeddings. Agentic RAG use cases also point towards potential future directions of training end-to-end RAG use cases with embeddings being fine-tuned for these enterprise use cases. As the scope of interleaved multimodal applications continues to expand, we invite the broader academic community to contribute novel evaluation frameworks to help benchmark these emerging capabilities.

7. Conclusion

Section Summary: Gemini Embedding 2 is a new AI model that creates versatile data representations by accepting any mix of text, images, audio, and video at once, improving on an earlier text-only version. It delivers strong results across everyday and specialized tasks—from code searches to fields like microscopy or cooking—while using native audio processing and avoiding complicated setup steps for greater efficiency. Overall, the approach supports more capable AI systems that can retrieve and connect information across different data types.

Gemini Embedding 2 represents a transformative step forward in general-purpose representation, delivering a state-of-the-art multimodal successor to our text-only Gemini Embedding model. Gemini Embedding 2 generalizes well across a wide variety of tasks by seamlessly producing embeddings for arbitrary combinations of interleaved inputs across all modalities including text, image, audio, and video. By leveraging Gemini’s core multimodal, multilingual and code-centric foundations, the Gemini Embedding 2 model achieves landmark performance on well-known embedding benchmarks like MSCOCO, Vatex and MMTEB with a particularly significant leap in code retrieval. Our findings highlight its remarkable versatility, showing that it excels not only in general tasks but also across specialized domains such as microscopy, astronomy, and the culinary arts. Furthermore, by demonstrating that native audio input outperforms traditional ASR in retrieval tasks and removing the need for costly task-specific instructions, Gemini Embedding 2 offers a highly efficient architecture. This unified approach to embedding facilitates a sophisticated cross-data retrieval setup, providing the essential infrastructure for building next-generation agentic systems in tandem with Gemini.

8. Full Results

Section Summary: The section titled "8. Full Results" presents comprehensive performance numbers for the Gemini Embedding 2 model on two MTEB benchmark categories. One table covers multilingual tasks, shown as an image without extractable text, while the second lists detailed scores across twelve code-related retrieval and understanding tasks, with most exceeding 90 percent accuracy and a few ranging lower. These results give a complete view of the model's effectiveness in specialized language and programming domains.

::: {caption="Table 7: Full results of Gemini Embedding 2 on MTEB (Multilingual)."}

:::

: Table 8: Full results of Gemini Embedding 2 on MTEB(Code).

Task Name Performance
AppsRetrieval 98.60
COIRCodeSearchNetRetrieval 91.90
CodeEditSearchRetrieval 91.94
CodeFeedbackMT 92.30
CodeFeedbackST 88.59
CodeSearchNetCCRetrieval 96.25
CodeSearchNetRetrieval 92.96
CodeTransOceanContest 93.19
CodeTransOceanDL 33.72
CosQA 52.05
StackOverflowQA 97.89
SyntheticText2SQL 78.11

9. Contributions and Acknowledgments

Section Summary: This section recognizes the many individuals who helped create the work, starting with a long list of core contributors who performed the main research and development, several of whom contributed equally. It then names a smaller leadership group that guided the effort. Finally, it thanks additional colleagues for their support, feedback, and assistance throughout the project.

Core Contributors ($^*$: equal contributions)

Madhuri Shanbhogue$^*$

Zhe Li$^*$

Shanfeng Zhang$^*$

Gustavo Hernández Ábrego$^*$

Shih-Cheng Huang$^*$

Aashi Jain$^*$

Daniel Salz

Sonam Goenka

Chaitra Hegde

Ji Ma Feiyang Chen Jiaxing Wu

Tanmaya Dabral

Babak Samari

Kevin Poulet

Daniel Cer

Kaifeng Chen

Paul Suganathan Hui Hui

Jovan Andonov

Philippe Schlattner

Jay Han

Iftekhar Naim

Wing Lowe

Vladimir Pchelin

Albert Yang

Yi-Ting Chen

Zhongli Ding

Grace Zhang

Georg Heigold

Yichang Chen

Antoine Reveillon

Brendan Mccloskey

Wenlei Zhou

Dahun Kim

Rui Meng

Emma Wang

Jack Zheng

Halley Fede

Zhen Yang

Keegan Mosley

Brian Potetz

Sahil Dua

Henrique Schechter Vera

Shen Gao

Hesen Zhang

Andreas Hess

Hengxuan Ying

Alberto Montes

Karan Gill

Min Choi

Sebastian Russo

Anja Hauth

Jinhyuk Lee

Michael Boratko

Megan Barnes

Vikram Rao

Claudiu Musat

Cyril Allauzen

Ehsan Variani

Shankar Kumar

Tom Bagby

Junyi Jiao

Yang Gu

Tengxin Li

Ayush Agrawal

Roberto Santana

Dev Nath

Stephen Karukas

Shuoxuan Han

Lucia Loher

Alice Twu

Nidhi Vyas

Siddharth Bhai

Frank Palma Gomez

Wangyuan Zhang

Chaoren Liu

Jizheng Yang

Steve Qiu

Shijie Zhang

Sujay Kulkarni

Sascha Rothe

Sean Nakamoto

Leadership

Raphael Hoffmann Zach Gleicher Yunhsuan Sung Qin Yin Tom Duerig Mojtaba Seyedhosseini

Acknowledgement

James Gan, Jon Matthews, Luciano Martins, Patrick Löber, Anna Kelly, Kristen Quan, Roxanne Daniel, Ryan Trostle, Tania Bedrax-Weiss, Srinivasan (Cheenu) Venkatachary, Howard Zhou, Tomas Izo.

References

Section Summary: The references section compiles a list of academic papers, conference proceedings, and technical reports primarily focused on artificial intelligence topics such as vision-language models, text and multimodal embeddings, and related benchmarks. These citations draw from foundational works on contrastive learning and transformer architectures as well as more recent advances in scalable representation learning and retrieval systems. Many entries point to arXiv preprints or proceedings from major machine learning venues, reflecting contributions from both academic and industry research groups.

[1] Radford et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning. pp. 8748–8763.

[2] Jia et al. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. pp. 4904–4916.

[3] Tschannen et al. (2025). SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding. Localization, and Dense Features. 6.

[4] Jiahui Yu et al. (2022). CoCa: Contrastive Captioners are Image-Text Foundation Models. https://arxiv.org/abs/2205.01917. arXiv:2205.01917.

[5] Comanici et al. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261.

[6] Lewis et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 33. pp. 9459–9474.

[7] Enevoldsen et al. (2025). MMTEB: Massive Multilingual Text Embedding Benchmark. arXiv preprint arXiv:2502.13595.

[8] Xinlei Chen et al. (2015). Microsoft COCO Captions: Data Collection and Evaluation Server. https://arxiv.org/abs/1504.00325. arXiv:1504.00325.

[9] Bryan A. Plummer et al. (2016). Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. https://arxiv.org/abs/1505.04870. arXiv:1505.04870.

[10] Jun Xu et al. (2016). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5288-5296. https://api.semanticscholar.org/CorpusID:206594535.

[11] Jacob Devlin et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186.

[12] Yinhan Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692. arXiv:1907.11692.

[13] Jianlv Chen et al. (2025). M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. https://arxiv.org/abs/2402.03216. arXiv:2402.03216.

[14] Liang Wang et al. (2024). Text Embeddings by Weakly-Supervised Contrastive Pre-training. https://arxiv.org/abs/2212.03533. arXiv:2212.03533.

[15] Jinhyuk Lee et al. (2024). Gecko: Versatile Text Embeddings Distilled from Large Language Models. https://arxiv.org/abs/2403.20327. arXiv:2403.20327.

[16] Chankyu Lee et al. (2025). NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. ArXiv. https://arxiv.org/abs/2405.17428. arXiv:2405.17428.

[17] Muennighoff et al. (2023). MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. pp. 2006–2029.

[18] Jinhyuk Lee et al. (2025). Gemini Embedding: Generalizable Embeddings from Gemini. https://arxiv.org/abs/2503.07891. arXiv:2503.07891.

[19] Lin Lin et al. (2025). SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model. https://arxiv.org/abs/2510.12709. arXiv:2510.12709.

[20] Danilo Poccia (2025). Amazon Nova Multimodal Embeddings: State-of-the-art embedding model for agentic RAG and semantic search. https://aws.amazon.com/blogs/aws/amazon-nova-multimodal-embeddings-now-available-in-amazon-bedrock/.

[21] Haonan Chen et al. (2025). MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings. https://arxiv.org/abs/2506.23115. arXiv:2506.23115.

[22] Sheng-Chieh Lin et al. (2025). MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs. https://arxiv.org/abs/2411.02571. arXiv:2411.02571.

[23] Paul Suganthan et al. (2025). Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks. https://arxiv.org/abs/2503.02656. arXiv:2503.02656.

[24] Oord et al. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[25] Kusupati et al. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems. 35. pp. 30233–30249.

[26] Izmailov et al. (2018). Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.

[27] Wortsman et al. (2022). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning. pp. 23965–23998.

[28] Qin et al. (2022). Introducing the Google Universal Image Embedding Challenge. https://research.google/blog/introducing-the-google-universal-image-embedding-challenge/.

[29] Deng et al. (2009). * ImageNet: A large-scale hierarchical image database *. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops). pp. 248-255. doi:10.1109/CVPR.2009.5206848. https://doi.ieeecomputersociety.org/10.1109/CVPR.2009.5206848.

[30] Yasumasa Onoe et al. (2024). DOCCI: Descriptions of Connected and Contrasting Images. https://arxiv.org/abs/2404.19753. arXiv:2404.19753.

[31] Sidorov et al. (2020). TextCaps: A Dataset for Image Captioning with Reading Comprehension. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II. pp. 742–758. doi:10.1007/978-3-030-58536- $5_44. [$ https://doi.org/10.1007/978-3-030-58536- $5_44]($ https://doi.org/10.1007/978-3-030-58536-$5_44)$.

[32] Xin Wang et al. (2020). VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. https://arxiv.org/abs/1904.03493. arXiv:1904.03493.

[33] Luowei Zhou et al. (2017). Towards Automatic Learning of Procedures from Web Instructional Videos. https://arxiv.org/abs/1703.09788. arXiv:1703.09788.

[34] Mensink et al. (2023). Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3113-3124.

[35] Quentin Macé et al. (2025). ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval. https://arxiv.org/abs/2505.17166. arXiv:2505.17166.

[36] Voyage AI (2026). Voyage Multimodal 3.5. https://blog.voyageai.com/2026/01/15/voyage-multimodal-3-5/.

[37] Google Cloud. Multimodal Embeddings API. https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api.

[38] Andre Araujo et al. (2022). Google Universal Image Embedding. https://kaggle.com/competitions/google-universal-image-embedding.

[39] Xu et al. (2016). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Xiangyang Li et al. (2024). CoIR: A Comprehensive Benchmark for Code Information Retrieval Models. https://arxiv.org/abs/2407.02883. arXiv:2407.02883.

[41] Georg Heigold et al. (2026). Massive Sound Embedding Benchmark (MSEB). https://arxiv.org/abs/2602.07143. arXiv:2602.07143.

[42] Burgess et al. (2025). Microvqa: A multimodal reasoning benchmark for microscopy-based scientific research. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19552–19564.

[43] Lu et al. (2022). Artcap: A dataset for image captioning of fine art paintings. IEEE Transactions on Computational Social Systems. 11(1). pp. 576–587.

[44] Zaman et al. (2025). AstroLLaVA: towards the unification of astronomical data and natural language. arXiv preprint arXiv:2504.08583.

[45] Marın et al. (2021). Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 43(1). pp. 187–203.

[46] Kevis-Kokitsi Maninis et al. (2025). TIPS: Text-Image Pretraining with Spatial awareness. https://arxiv.org/abs/2410.16512. arXiv:2410.16512.

[47] Tianyu Zheng et al. (2024). OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. https://arxiv.org/abs/2402.14658. arXiv:2402.14658.