File size: 10,129 Bytes
b532237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b21574d
667c7c2
b532237
 
 
b21574d
6a88145
b532237
 
 
 
 
 
 
 
 
 
b5ce6d9
 
 
b532237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b21574d
b532237
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- visual-document-retrieval
- cross-modal-distillation
- knowledge-distillation
- document-retrieval
- multilingual
- nanovdr
base_model: distilbert/distilbert-base-uncased
language:
- en
- de
- fr
- es
- it
- pt
license: apache-2.0
datasets:
- openbmb/VisRAG-Ret-Train-Synthetic-data
- openbmb/VisRAG-Ret-Train-In-domain-data
- vidore/colpali_train_set
- llamaindex/vdr-multilingual-train
model-index:
- name: NanoVDR-S-Multi
  results:
  - task:
      type: retrieval
    dataset:
      name: ViDoRe v1
      type: vidore/vidore-benchmark-667173f98e70a1c0fa4d
    metrics:
    - name: NDCG@5
      type: ndcg_at_5
      value: 82.2
  - task:
      type: retrieval
    dataset:
      name: ViDoRe v2
      type: vidore/vidore-benchmark-v2
    metrics:
    - name: NDCG@5
      type: ndcg_at_5
      value: 61.9
---

<p align="center">
  <img width="560" src="banner.png" alt="NanoVDR"/>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2603.12824">Paper</a> |
  <a href="https://huggingface.co/blog/Ryenhails/nanovdr">Blog</a> |
  <a href="https://huggingface.co/collections/nanovdr/nanovdr">All Models</a>
</p>

> **Paper**: [NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval](https://arxiv.org/abs/2603.12824)

# NanoVDR-S-Multi

**The recommended NanoVDR model for production use.**

NanoVDR-S-Multi is a **69M-parameter multilingual text-only** query encoder for visual document retrieval. It encodes text queries into the same embedding space as a frozen 2B VLM teacher ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)), so you can retrieve document page images using **only a DistilBERT forward pass** — no vision model at query time.

### Highlights

- **95.1% teacher retention** — a 69M text-only model recovers 95% of a 2B VLM teacher across 22 ViDoRe datasets
- **Outperforms DSE-Qwen2 (2B)** on multilingual v2 (+6.2) and v3 (+4.1) with **32x fewer parameters**
- **Outperforms ColPali (~3B)** on multilingual v2 (+7.2) and v3 (+4.5) with **single-vector cosine** retrieval (no MaxSim)
- **Single-vector retrieval** — queries and documents share the same 2048-dim embedding space as [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B); retrieval is a plain dot product, FAISS-compatible, **4 KB per page** (float16)
- **Lightweight on storage** — 282 MB model file; doc index costs 64× less than ColPali's multi-vector patches
- **51 ms CPU query latency** — 50x faster than DSE-Qwen2, 143x faster than ColPali
- **6 languages**: English, German, French, Spanish, Italian, Portuguese — all >92% teacher retention

---

## Results

| Model | Type | Params | ViDoRe v1 | ViDoRe v2 | ViDoRe v3 | Avg Retention |
|-------|------|--------|-----------|-----------|-----------|---------------|
| Tomoro-8B | VLM | 8.0B | 90.6 | 65.0 | 59.0 | — |
| Qwen3-VL-Emb (Teacher) | VLM | 2.0B | 84.3 | 65.3 | 50.0 | — |
| DSE-Qwen2 | VLM | 2.2B | 85.1 | 55.7 | 42.4 | — |
| ColPali | VLM | ~3B | 84.2 | 54.7 | 42.0 | — |
| **NanoVDR-S-Multi** | **Text-only** | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** |

<sub>NDCG@5 (×100). v1 = 10 English datasets, v2 = 4 multilingual datasets, v3 = 8 multilingual datasets.</sub>

### Per-Language Retention (v2 + v3, 19,537 queries)

| Language | #Queries | Teacher | NanoVDR-S-Multi | Retention |
|----------|----------|---------|-----------------|-----------|
| English | 6,237 | 64.0 | 60.3 | 94.3% |
| French | 2,694 | 51.0 | 47.8 | 93.6% |
| Portuguese | 2,419 | 48.7 | 46.1 | 94.6% |
| Spanish | 2,694 | 51.4 | 47.8 | 93.1% |
| Italian | 2,419 | 49.0 | 45.7 | 93.3% |
| German | 2,694 | 49.3 | 45.4 | 92.0% |

All 6 languages achieve **>92%** of the 2B teacher's performance.

---

## Quick Start

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")

queries = [
    "What was the revenue growth in Q3 2024?",           # English
    "Quel est le chiffre d'affaires du trimestre?",       # French
    "Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
    "¿Cuál fue el crecimiento de ingresos en el Q3?",     # Spanish
    "Qual foi o crescimento da receita no terceiro trimestre?", # Portuguese
    "Qual è stata la crescita dei ricavi nel terzo trimestre?", # Italian
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape)  # (6, 2048)

# Cosine similarity against pre-indexed document embeddings
# scores = query_embeddings @ doc_embeddings.T
```

### Prerequisites: Document Indexing with Teacher Model

NanoVDR is a **query encoder only**. Documents must be indexed offline using the teacher VLM ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)), which encodes page images into 2048-d embeddings. This is a one-time cost.

```python
# pip install transformers qwen-vl-utils torch
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder

teacher = Qwen3VLEmbedder(model_name_or_path="Qwen/Qwen3-VL-Embedding-2B")

# Encode document page images
documents = [
    {"image": "page_001.png"},
    {"image": "page_002.png"},
    # ... all document pages in your corpus
]
doc_embeddings = teacher.process(documents)  # (N, 2048), L2-normalized
```

> **Note:** The `Qwen3VLEmbedder` class and full usage guide (including vLLM/SGLang acceleration) are available at the [Qwen3-VL-Embedding-2B model page](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B). Document indexing requires a GPU; once indexed, retrieval uses only CPU.

### Full Retrieval Pipeline

```python
import numpy as np
from sentence_transformers import SentenceTransformer

# doc_embeddings: (N, 2048) numpy array from teacher indexing above

# Step 1: Encode text queries with NanoVDR (CPU, ~51ms per query)
student = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?")  # shape: (2048,)

# Step 2: Retrieve via cosine similarity
scores = query_emb @ doc_embeddings.T
top_k_indices = np.argsort(scores)[-5:][::-1]
```

---

## How It Works

NanoVDR uses **asymmetric cross-modal distillation** to decouple query and document encoding:

| | Document Encoding (offline) | Query Encoding (online) |
|-|----------------------------|------------------------|
| **Model** | Qwen3-VL-Embedding-2B (frozen) | NanoVDR-S-Multi (69M) |
| **Input** | Page images | Text queries (6 languages) |
| **Output** | 2048-d embedding | 2048-d embedding |
| **Hardware** | GPU (one-time indexing) | CPU (real-time serving) |

The student is trained to **align query embeddings** with the teacher's query embeddings via pointwise cosine loss — no document embeddings or hard negatives are needed during training. At inference, student query embeddings are directly compatible with teacher document embeddings.

---

## Training

| | Value |
|--|-------|
| Base model | `distilbert/distilbert-base-uncased` (66M) |
| Projector | 2-layer MLP: 768 → 768 → 2048 (2.4M params) |
| Total params | 69M |
| Objective | Pointwise cosine alignment with teacher query embeddings |
| Training data | 1.49M pairs — 711K original + 778K translated queries |
| Languages | EN (original) + DE, FR, ES, IT, PT (translated via [Helsinki-NLP Opus-MT](https://huggingface.co/Helsinki-NLP)) |
| Epochs | 10 |
| Batch size | 1,024 (effective) |
| Learning rate | 3e-4 (OneCycleLR, 3% warmup) |
| Hardware | 1× H200 GPU |
| Training time | ~10 GPU-hours |
| Embedding caching | ~1 GPU-hour (teacher encodes all queries in text mode) |

### Multilingual Augmentation Pipeline

1. Extract 489K English queries from the 711K training set
2. Translate each to 5 languages using Helsinki-NLP Opus-MT → 778K translated queries
3. Re-encode translated queries with the frozen teacher in text mode (15 min on H200)
4. Combine: 711K original + 778K translated = **1.49M training pairs**
5. Train with halved epochs (10 vs 20) and slightly higher lr (3e-4 vs 2e-4) to match total steps

---

## Efficiency

| | NanoVDR-S-Multi | DSE-Qwen2 | ColPali | Tomoro-8B |
|--|-----------------|-----------|---------|-----------|
| Parameters | **69M** | 2,209M | ~3B | 8,000M |
| Query latency (CPU, B=1) | **51 ms** | 2,539 ms | 7,300 ms | GPU only |
| Checkpoint size | **274 MB** | 8.8 GB | 11.9 GB | 35.1 GB |
| Index type | Single-vector | Single-vector | Multi-vector | Multi-vector |
| Scoring | Cosine | Cosine | MaxSim | MaxSim |
| Index storage (500K pages) | **4.1 GB** | 3.1 GB | 128 GB | 128 GB |

---

## Model Variants

NanoVDR-S-Multi is the **recommended model**. The other variants are provided for research and ablation purposes.

| Model | Backbone | Params | v1 | v2 | v3 | Retention | Latency | Recommended |
|-------|----------|--------|----|----|----|-----------|---------| ------------|
| **[NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi)** | **DistilBERT** | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** | **51 ms** | **Yes** |
| [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | 51 ms | EN-only |
| [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | 101 ms | Ablation |
| [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | 109 ms | Ablation |

## Key Properties

| Property | Value |
|----------|-------|
| Output dimension | 2048 (aligned with Qwen3-VL-Embedding-2B) |
| Max sequence length | 512 tokens |
| Supported languages | EN, DE, FR, ES, IT, PT |
| Similarity function | Cosine similarity |
| Pooling | Mean pooling |
| Normalization | L2-normalized |

## Citation

```bibtex
@article{nanovdr2026,
  title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
  author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
  journal={arXiv preprint arXiv:2603.12824},
  year={2026}
}
```

## License

Apache 2.0