Model Overview
Description:
llama-nemotron-embed-vl-1b-v2 was developed by NVIDIA for multimodal question-answering retrieval. The model can embed document pages in the form of image, text, or combined image–text inputs. Documents can be retrieved given a user query in text form. The model supports page images containing text, tables, charts, and infographics. We report the evaluation of this model on two internal multimodal retrieval benchmarks, and on the popular ViDoRe V1 and V2 benchmarks and the new Vidore V3 benchmark.
An embedding model is a crucial component of a retrieval system because it transforms information into dense vector representations. An embedding model is typically a transformer encoder that processes tokens of input text or images (for example, questions, passages, or page images) to output an embedding. llama-nemotron-embed-vl-1b-v2 is a combined language and vision model.
The llama-nemotron-embed-vl-1b-v2 is part of the Nemotron RAG collection of open models available on HuggingFace. It is also available for optimized inference as a NIM (NVIDIA Inference Microservice) from NVIDIA NeMo Retriever, which provides state-of-the-art, commercially-ready models and microservices optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can readily customize them for domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants.
This model is ready for commercial use.
License/Terms of use
The use of this model is governed by the NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0. Additional Information: Llama 3.2 Community Model License Agreement. Built with Llama.
Deployment Geography:
Global
Use Case:
The llama-nemotron-embed-vl-1b-v2 is suitable for users who want to build a multimodal question-and-answer application over a large corpus, leveraging the latest dense retrieval technology.
The input of the model is a text or document image and the output a fixed-size embedding vector.
The embedding model is a bi-encoder that supports context in textual format (e.g. the query or the OCR text of a page or a section of a document) or the image of a document page.
Typically, the embedding model is used first to embed (vectorize) the whole corpus (document images or text chunks), and embeddings are stored stored in a vector database associated to its raw content (image or text). Then at inference time, the embedding model is used to embed the query. The embeddings of the query and relevant context from the corpus should be close in the embedding space.
Release Date:
12/18/2025 via https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2
References(s):
Technical report - "Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model"
Citation
@inproceedings{moreira2025_nvretriever,
author = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even},
title = {Improving Text Embedding Models with Positive-aware Hard-negative Mining},
year = {2025},
isbn = {9798400720406},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746252.3761254},
doi = {10.1145/3746252.3761254},
pages = {2169–2178},
numpages = {10},
keywords = {contrastive learning, distillation, embedding models, hard-negative mining, rag, text retrieval, transformers},
location = {Seoul, Republic of Korea},
series = {CIKM '25}
}
Model Architecture
Architecture Type: Transformer
Network Architecture: Eagle VLM architecture with Llama 3.2 1B language model and SigLip2 400 image encoder
The llama-nemotron-embed-vl-1b-v2 embedding model is a transformer encoder, with approximately 1.7B parameters. It is a fine-tuned version of NVIDIA Eagle family of models, using Llama 3.2 1B language model and SigLip2 400M image encoder. The language model submodule has 16 layers with embedding size of 2048, and is pre-trained on public datasets. Embedding models for retrieval are typically trained with a bi-encoder architecture, that encodes query and document independently. The model applies mean pooling over the output token embeddings from the language model, so that it outputs a single embedding with 2048 dimensions. Contrastive learning is used to train the embedding model to maximize the similarity between the query and the document page that contains the answer, while minimizing the similarity between the query and sampled negative pages that are not useful to answer the question.
The vision-language model encoder incorporates key innovations from NVIDIA, including Eagle 2 research and nemoretriever-parse, which use a tiling-based VLM architecture. This architecture, available on Hugging Face, significantly enhances multimodal understanding through its dynamic tiling and mixture of vision encoders design. It particularly improves performance on tasks that involve high-resolution images and complex visual content.
Number of model parameters:
- Llama 3.2 1B language model: 1.23 B (Transformer parameters: 973 M, Token embedding parameters: 262 M)
- SigLip 2 image encoder: 428.77 M
Input(s):
Input Type(s): Image, Text
Input Format(s):
- Image: Red, Green, Blue (RGB)
- Text: String
Input Parameters:
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
Output Parameters:
- Image/Text Embedding (2D) - embedding of 2048 dimensions
Other Properties Related to Input:
- The model's maximum context length we evaluated is 10240 tokens.
- Each image tile consumes 256 tokens. We have tested this model extensively with these settings on config.json -
max_input_tiles = 6,use_thumbnails = True, so that every image is split into maximum 6 tiles + 1 thumbnail (whole image at lower resolution), consuming about 1792 visual tokens. If you embed both page image and text (e.g. page OCR), the sum of the visual tokens (explained above) and the text tokens should not be higher than 10240 tokens.
Output(s)
Output Type: Floats
Output Format: List of float arrays (embeddings)
Output: Model outputs embedding vectors of maximum dimension 2048 for each input.
Other Properties Related to Output: N/A
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (such as GPU cores) and software frameworks (such as CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Installation
The model requires transformers>=4.47.1, and also flash-attention.
pip install "transformers>=4.47.1,<5.0.0"
pip install "flash-attn>=2.6.3,<2.8" --no-build-isolation
Transformers Usage
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image
modality = "image"
# Load model
model_name_or_path = "nvidia/llama-nemotron-embed-vl-1b-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="flash_attention_2",
device_map="auto"
).eval()
# Set max number of tokens (p_max_length) based on modality
if modality == "image":
p_max_length = 2048
elif modality == "image_text":
p_max_length = 10240
elif modality == "text":
p_max_length = 8192
model.processor.p_max_length = p_max_length
# Sets max number of tiles an image can be split. Each tile consumes 256 tokens.
model.processor.max_input_tiles = 6
# Enables an extra tile with the full image at lower resolution
model.processor.use_thumbnail = True
# Example usage: single query with multiple image documents
query = "How is AI improving the intelligence and capabilities of robots?"
image_paths = [
"https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg",
"https://blogs.nvidia.com/wp-content/uploads/2018/01/automotive-key-visual-corp-blog-level4-av-og-1280x680-1.png",
"https://developer-blogs.nvidia.com/wp-content/uploads/2025/02/hc-press-evo2-nim-25-featured-b.jpg"
]
# Load all images (load_image handles both local paths and URLs)
images = [load_image(img_path) for img_path in image_paths]
# Text descriptions corresponding to each image/document (used in image_text and text modalities)
document_texts = [
"AI enables robots to perceive, plan, and act autonomously.",
"AI is transforming autonomous vehicles by enabling safer, smarter, and more reliable decision-making on the road.",
"A biological foundation model designed to analyze and generate DNA, RNA, and protein sequences."
]
# Run inference (common for all modalities)
with torch.inference_mode():
queries_embeddings = model.encode_queries([query])
if modality == "image_text":
documents_embeddings = model.encode_documents(images=images, texts=document_texts)
elif modality == "image":
documents_embeddings = model.encode_documents(images=images)
elif modality == "text":
documents_embeddings = model.encode_documents(texts=document_texts)
def _l2_normalize(x: torch.Tensor, eps: float = 1e-12) -> torch.Tensor:
return x / (x.norm(p=2, dim=-1, keepdim=True) + eps)
# Computes cosine similarity (as they are already normalized) between the query embeddings and the document embeddings
cos_sim = _l2_normalize(queries_embeddings) @ _l2_normalize(documents_embeddings).T
# Flatten logits to 1D array (handle both [batch_size] and [batch_size, 1] shapes)
cos_sim_flat = cos_sim.flatten()
# Get sorted indices (highest to lowest)
sorted_indices = torch.argsort(cos_sim_flat, descending=True)
print(f"\nQuery: {query}\n")
print(f"\nRanking (highest to lowest relevance for the modality {modality}):")
for rank, idx in enumerate(sorted_indices, 1):
doc_idx = idx.item()
sim_val = cos_sim_flat[doc_idx].item()
if modality == "text":
print(f" Rank {rank}: cos_sim={sim_val:.4f} | Text: {document_texts[doc_idx]}")
else: # image or image_text modality
print(f" Rank {rank}: cos_sim={sim_val:.4f} | Image: {image_paths[doc_idx]}")
Software Integration:
Runtime Engine(s): TensorRT, Triton, NeMo Retriever Embedding NIM (upcoming)
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace
Preferred/Supported Operating System(s): Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
llama-nemotron-embed-vl-1b-v2
Training and Evaluation Datasets:
Training Dataset
The development of large-scale, public, open-QA datasets has enabled tremendous progress in powerful embedding models. However, the following issues limit the use of these models in commercial settings.
- One popular dataset, named MS MARCO, restricts ‌commercial licensing.
- Many multimodal datasets use synthetic data generation with proprietary models.
NVIDIA's training dataset is based on public QA datasets, and only includes datasets that have a license for commercial applications.
Properties: The text component is comprised of semi-supervised pre-training on 12M samples from public datasets and fine-tuning on 1.5M samples from public datasets. The VLM component uses only commercially-viable data from the Eagle2 training data and other public datasets.
Data Modality: Image, Text
Image Training Data Size
- 1 Million to 1 Billion Images (about 2,57 Million)
Text Training Data Size
- 1 Billion to 10 Trillion Tokens (about 1.6 Billion)
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Evaluation Datasets
Vision document retrieval benchmarks
We evaluated llama-nemotron-embed-vl-1b-v2 on the popular ViDoRe V1, V2 and on the new ViDoRe V3.
More details on ViDoRe leaderboard can be found on their leaderboard.
We also evaluated the llama-nemotron-embed-vl-1b-v2 on two internal visual document retrieval datasets:
- DigitalCorpora-10k: A dataset with questions based on a corpus of 10k documents from DigitalCorpora that have a good mixture of text, tables, and charts.
- Earnings V2: an internal retrieval dataset of 287 questions based on 500 PDFs, mostly consisting of earnings reports from big tech companies.
For those interested in reproducing our results, one of our internal datasets (DigitalCorpora-10k) can be created by following instructions in this notebook (download script) from the NeMo Retriever Extraction GitHub repository.
Text retrieval benchmarks
We evaluated llama-nemotron-embed-vl-1b-v2 on 92 text retrieval datasets, from the benchmarks BEIR, MIRACL (multi-language), MLQA (cross-language) and MLDR (long-context).
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Evaluation Results
Visual Document Retrieval (page retrieval)
In this section, we compare the performance of llama-nemotron-embed-vl-1b-v2 with its previous version llama-3.2-nemoretriever-1b-vl-embed-v1 (closed weights), available as a NIM here. You can see here how the previous model compares to other small sized VLMs.
In below table, it is possible to see that the new llama-nemotron-embed-vl-1b-v2 provides much better retrieval accuracy (Recall@5) for the image and image+text modalities than its predecessor.
Note: Image+Text modality means that both the page image and its text (that might be extracted by some OCR library like NV-Ingest) are fed as input to the embedding model for more accurate representation and retrieval.
| Modality | |||
|---|---|---|---|
| Model | Text | Image | Image + Text |
| llama-nemotron-embed-1b-v2 (former name: llama-3_2-nv-embedqa-1b-v2) | 69.35% | - | - |
| llama-3.2-nemoretriever-1b-vlm-embed-v1 (closed weights, NIM-only) | 71.07% | 70.46% | 71.71% |
| llama-nemotron-embed-vl-1b-v2 | 71.04% | 71.20% | 73.24% |
Text Retrieval benchmarks (chunk retrieval)
The llama-nemotron-embed-vl-1b-v2 also improves retrieval accuracy on text retrieval benchmarks compared to our competitive text-only embedding model llama-nemotron-embed-1b-v2. That means you can deploy our single VLM-based model llama-nemotron-embed-vl-1b-v2 regardless the modality of your corpus to be retrieved is image or text.
| Model | BEIR retrieval + TechQA | MIRACL | MLQA | MLDR | Average |
|---|---|---|---|---|---|
| llama-nemotron-embed-1b-v2 (former name: llama-3_2-nv-embedqa-1b-v2) | 68.60% | 60.75% | 79.86% | 59.55% | 67.19% |
| llama-nemotron-embed-vl-1b-v2 | 69.19% | 60.48% | 79.90% | 60.09% | 67.42% |
Inference
Acceleration Engine: TensorRT
Test Hardware: H100, A100, L40S, A10G, B200, RTX PRO 6000
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing | None |
| Measures taken to mitigate against unwanted bias | None |
Explainability
| Field | Response |
|---|---|
| Intended Application & Domain: | Document and query embedding for question and answer retrieval. |
| Model Type: | Transformer encoder. |
| Intended User: | Generative AI creators working with conversational AI models. Users who want to build a question and answer application over a large corpus, leveraging the latest dense retrieval technologies. The corpus can be images of PDFs, such as text, tables, charts or infographics, or extracted plain text. |
| Output: | Array of float numbers (Dense Vector Representation for the input text). |
| Describe how the model works: | Model transforms the input into a dense vector representation. |
| Technical Limitations: | The model's max sequence length is 10240. Longer text inputs should be truncated. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | N/A |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Accuracy, Throughput, and Latency. |
| Potential Known Risks: | This model does not guarantee to always retrieve the correct passage(s) for a given query. |
| Licensing & Terms of Use: | The use of this model is governed by the NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0. Additional Information: Llama 3.2 Community Model License Agreement. Built with Llama. |
Privacy
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal data? | None |
| Personal data used to create this model? | None Known |
| How often is dataset reviewed? | Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes. |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |
| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |
Safety
| Field | Response |
|---|---|
| Model Application(s): | Document Embedding for Retrieval. User queries can be text and documents can be text, document page images, charts, tables, and infographics. |
| Describe the life critical impact (if present) | Not applicable |
| Use Case Restrictions: | The use of this model is governed by the NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0. Additional Information: Llama 3.2 Community Model License Agreement. Built with Llama. |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
- Downloads last month
- 195