File size: 10,129 Bytes
b532237 b21574d 667c7c2 b532237 b21574d 6a88145 b532237 b5ce6d9 b532237 b21574d b532237 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | ---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- visual-document-retrieval
- cross-modal-distillation
- knowledge-distillation
- document-retrieval
- multilingual
- nanovdr
base_model: distilbert/distilbert-base-uncased
language:
- en
- de
- fr
- es
- it
- pt
license: apache-2.0
datasets:
- openbmb/VisRAG-Ret-Train-Synthetic-data
- openbmb/VisRAG-Ret-Train-In-domain-data
- vidore/colpali_train_set
- llamaindex/vdr-multilingual-train
model-index:
- name: NanoVDR-S-Multi
results:
- task:
type: retrieval
dataset:
name: ViDoRe v1
type: vidore/vidore-benchmark-667173f98e70a1c0fa4d
metrics:
- name: NDCG@5
type: ndcg_at_5
value: 82.2
- task:
type: retrieval
dataset:
name: ViDoRe v2
type: vidore/vidore-benchmark-v2
metrics:
- name: NDCG@5
type: ndcg_at_5
value: 61.9
---
<p align="center">
<img width="560" src="banner.png" alt="NanoVDR"/>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2603.12824">Paper</a> |
<a href="https://huggingface.co/blog/Ryenhails/nanovdr">Blog</a> |
<a href="https://huggingface.co/collections/nanovdr/nanovdr">All Models</a>
</p>
> **Paper**: [NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval](https://arxiv.org/abs/2603.12824)
# NanoVDR-S-Multi
**The recommended NanoVDR model for production use.**
NanoVDR-S-Multi is a **69M-parameter multilingual text-only** query encoder for visual document retrieval. It encodes text queries into the same embedding space as a frozen 2B VLM teacher ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)), so you can retrieve document page images using **only a DistilBERT forward pass** — no vision model at query time.
### Highlights
- **95.1% teacher retention** — a 69M text-only model recovers 95% of a 2B VLM teacher across 22 ViDoRe datasets
- **Outperforms DSE-Qwen2 (2B)** on multilingual v2 (+6.2) and v3 (+4.1) with **32x fewer parameters**
- **Outperforms ColPali (~3B)** on multilingual v2 (+7.2) and v3 (+4.5) with **single-vector cosine** retrieval (no MaxSim)
- **Single-vector retrieval** — queries and documents share the same 2048-dim embedding space as [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B); retrieval is a plain dot product, FAISS-compatible, **4 KB per page** (float16)
- **Lightweight on storage** — 282 MB model file; doc index costs 64× less than ColPali's multi-vector patches
- **51 ms CPU query latency** — 50x faster than DSE-Qwen2, 143x faster than ColPali
- **6 languages**: English, German, French, Spanish, Italian, Portuguese — all >92% teacher retention
---
## Results
| Model | Type | Params | ViDoRe v1 | ViDoRe v2 | ViDoRe v3 | Avg Retention |
|-------|------|--------|-----------|-----------|-----------|---------------|
| Tomoro-8B | VLM | 8.0B | 90.6 | 65.0 | 59.0 | — |
| Qwen3-VL-Emb (Teacher) | VLM | 2.0B | 84.3 | 65.3 | 50.0 | — |
| DSE-Qwen2 | VLM | 2.2B | 85.1 | 55.7 | 42.4 | — |
| ColPali | VLM | ~3B | 84.2 | 54.7 | 42.0 | — |
| **NanoVDR-S-Multi** | **Text-only** | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** |
<sub>NDCG@5 (×100). v1 = 10 English datasets, v2 = 4 multilingual datasets, v3 = 8 multilingual datasets.</sub>
### Per-Language Retention (v2 + v3, 19,537 queries)
| Language | #Queries | Teacher | NanoVDR-S-Multi | Retention |
|----------|----------|---------|-----------------|-----------|
| English | 6,237 | 64.0 | 60.3 | 94.3% |
| French | 2,694 | 51.0 | 47.8 | 93.6% |
| Portuguese | 2,419 | 48.7 | 46.1 | 94.6% |
| Spanish | 2,694 | 51.4 | 47.8 | 93.1% |
| Italian | 2,419 | 49.0 | 45.7 | 93.3% |
| German | 2,694 | 49.3 | 45.4 | 92.0% |
All 6 languages achieve **>92%** of the 2B teacher's performance.
---
## Quick Start
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
queries = [
"What was the revenue growth in Q3 2024?", # English
"Quel est le chiffre d'affaires du trimestre?", # French
"Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
"¿Cuál fue el crecimiento de ingresos en el Q3?", # Spanish
"Qual foi o crescimento da receita no terceiro trimestre?", # Portuguese
"Qual è stata la crescita dei ricavi nel terzo trimestre?", # Italian
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape) # (6, 2048)
# Cosine similarity against pre-indexed document embeddings
# scores = query_embeddings @ doc_embeddings.T
```
### Prerequisites: Document Indexing with Teacher Model
NanoVDR is a **query encoder only**. Documents must be indexed offline using the teacher VLM ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)), which encodes page images into 2048-d embeddings. This is a one-time cost.
```python
# pip install transformers qwen-vl-utils torch
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
teacher = Qwen3VLEmbedder(model_name_or_path="Qwen/Qwen3-VL-Embedding-2B")
# Encode document page images
documents = [
{"image": "page_001.png"},
{"image": "page_002.png"},
# ... all document pages in your corpus
]
doc_embeddings = teacher.process(documents) # (N, 2048), L2-normalized
```
> **Note:** The `Qwen3VLEmbedder` class and full usage guide (including vLLM/SGLang acceleration) are available at the [Qwen3-VL-Embedding-2B model page](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B). Document indexing requires a GPU; once indexed, retrieval uses only CPU.
### Full Retrieval Pipeline
```python
import numpy as np
from sentence_transformers import SentenceTransformer
# doc_embeddings: (N, 2048) numpy array from teacher indexing above
# Step 1: Encode text queries with NanoVDR (CPU, ~51ms per query)
student = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?") # shape: (2048,)
# Step 2: Retrieve via cosine similarity
scores = query_emb @ doc_embeddings.T
top_k_indices = np.argsort(scores)[-5:][::-1]
```
---
## How It Works
NanoVDR uses **asymmetric cross-modal distillation** to decouple query and document encoding:
| | Document Encoding (offline) | Query Encoding (online) |
|-|----------------------------|------------------------|
| **Model** | Qwen3-VL-Embedding-2B (frozen) | NanoVDR-S-Multi (69M) |
| **Input** | Page images | Text queries (6 languages) |
| **Output** | 2048-d embedding | 2048-d embedding |
| **Hardware** | GPU (one-time indexing) | CPU (real-time serving) |
The student is trained to **align query embeddings** with the teacher's query embeddings via pointwise cosine loss — no document embeddings or hard negatives are needed during training. At inference, student query embeddings are directly compatible with teacher document embeddings.
---
## Training
| | Value |
|--|-------|
| Base model | `distilbert/distilbert-base-uncased` (66M) |
| Projector | 2-layer MLP: 768 → 768 → 2048 (2.4M params) |
| Total params | 69M |
| Objective | Pointwise cosine alignment with teacher query embeddings |
| Training data | 1.49M pairs — 711K original + 778K translated queries |
| Languages | EN (original) + DE, FR, ES, IT, PT (translated via [Helsinki-NLP Opus-MT](https://huggingface.co/Helsinki-NLP)) |
| Epochs | 10 |
| Batch size | 1,024 (effective) |
| Learning rate | 3e-4 (OneCycleLR, 3% warmup) |
| Hardware | 1× H200 GPU |
| Training time | ~10 GPU-hours |
| Embedding caching | ~1 GPU-hour (teacher encodes all queries in text mode) |
### Multilingual Augmentation Pipeline
1. Extract 489K English queries from the 711K training set
2. Translate each to 5 languages using Helsinki-NLP Opus-MT → 778K translated queries
3. Re-encode translated queries with the frozen teacher in text mode (15 min on H200)
4. Combine: 711K original + 778K translated = **1.49M training pairs**
5. Train with halved epochs (10 vs 20) and slightly higher lr (3e-4 vs 2e-4) to match total steps
---
## Efficiency
| | NanoVDR-S-Multi | DSE-Qwen2 | ColPali | Tomoro-8B |
|--|-----------------|-----------|---------|-----------|
| Parameters | **69M** | 2,209M | ~3B | 8,000M |
| Query latency (CPU, B=1) | **51 ms** | 2,539 ms | 7,300 ms | GPU only |
| Checkpoint size | **274 MB** | 8.8 GB | 11.9 GB | 35.1 GB |
| Index type | Single-vector | Single-vector | Multi-vector | Multi-vector |
| Scoring | Cosine | Cosine | MaxSim | MaxSim |
| Index storage (500K pages) | **4.1 GB** | 3.1 GB | 128 GB | 128 GB |
---
## Model Variants
NanoVDR-S-Multi is the **recommended model**. The other variants are provided for research and ablation purposes.
| Model | Backbone | Params | v1 | v2 | v3 | Retention | Latency | Recommended |
|-------|----------|--------|----|----|----|-----------|---------| ------------|
| **[NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi)** | **DistilBERT** | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** | **51 ms** | **Yes** |
| [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | 51 ms | EN-only |
| [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | 101 ms | Ablation |
| [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | 109 ms | Ablation |
## Key Properties
| Property | Value |
|----------|-------|
| Output dimension | 2048 (aligned with Qwen3-VL-Embedding-2B) |
| Max sequence length | 512 tokens |
| Supported languages | EN, DE, FR, ES, IT, PT |
| Similarity function | Cosine similarity |
| Pooling | Mean pooling |
| Normalization | L2-normalized |
## Citation
```bibtex
@article{nanovdr2026,
title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
journal={arXiv preprint arXiv:2603.12824},
year={2026}
}
```
## License
Apache 2.0
|