Paper | GitHub | All Models
NanoVDR-S-Multi
The recommended NanoVDR model for production use.
NanoVDR-S-Multi is a 69M-parameter multilingual text-only query encoder for visual document retrieval. It encodes text queries into the same embedding space as a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B), so you can retrieve document page images using only a DistilBERT forward pass — no vision model at query time.
Highlights
- 95.1% teacher retention — a 69M text-only model recovers 95% of a 2B VLM teacher across 22 ViDoRe datasets
- Outperforms DSE-Qwen2 (2B) on multilingual v2 (+6.2) and v3 (+4.1) with 32x fewer parameters
- Outperforms ColPali (~3B) on multilingual v2 (+7.2) and v3 (+4.5) with single-vector cosine retrieval (no MaxSim)
- Single-vector retrieval — queries and documents share the same 2048-dim embedding space as Qwen3-VL-Embedding-2B; retrieval is a plain dot product, FAISS-compatible, 4 KB per page (float16)
- Lightweight on storage — 282 MB model file; doc index costs 64× less than ColPali's multi-vector patches
- 51 ms CPU query latency — 50x faster than DSE-Qwen2, 143x faster than ColPali
- 6 languages: English, German, French, Spanish, Italian, Portuguese — all >92% teacher retention
Results
| Model | Type | Params | ViDoRe v1 | ViDoRe v2 | ViDoRe v3 | Avg Retention |
|---|---|---|---|---|---|---|
| Tomoro-8B | VLM | 8.0B | 90.6 | 65.0 | 59.0 | — |
| Qwen3-VL-Emb (Teacher) | VLM | 2.0B | 84.3 | 65.3 | 50.0 | — |
| DSE-Qwen2 | VLM | 2.2B | 85.1 | 55.7 | 42.4 | — |
| ColPali | VLM | ~3B | 84.2 | 54.7 | 42.0 | — |
| NanoVDR-S-Multi | Text-only | 69M | 82.2 | 61.9 | 46.5 | 95.1% |
NDCG@5 (×100). v1 = 10 English datasets, v2 = 4 multilingual datasets, v3 = 8 multilingual datasets.
Per-Language Retention (v2 + v3, 19,537 queries)
| Language | #Queries | Teacher | NanoVDR-S-Multi | Retention |
|---|---|---|---|---|
| English | 6,237 | 64.0 | 60.3 | 94.3% |
| French | 2,694 | 51.0 | 47.8 | 93.6% |
| Portuguese | 2,419 | 48.7 | 46.1 | 94.6% |
| Spanish | 2,694 | 51.4 | 47.8 | 93.1% |
| Italian | 2,419 | 49.0 | 45.7 | 93.3% |
| German | 2,694 | 49.3 | 45.4 | 92.0% |
All 6 languages achieve >92% of the 2B teacher's performance.
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
queries = [
"What was the revenue growth in Q3 2024?", # English
"Quel est le chiffre d'affaires du trimestre?", # French
"Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
"¿Cuál fue el crecimiento de ingresos en el Q3?", # Spanish
"Qual foi o crescimento da receita no terceiro trimestre?", # Portuguese
"Qual è stata la crescita dei ricavi nel terzo trimestre?", # Italian
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape) # (6, 2048)
# Cosine similarity against pre-indexed document embeddings
# scores = query_embeddings @ doc_embeddings.T
Prerequisites: Document Indexing with Teacher Model
NanoVDR is a query encoder only. Documents must be indexed offline using the teacher VLM (Qwen3-VL-Embedding-2B), which encodes page images into 2048-d embeddings. This is a one-time cost.
# pip install transformers qwen-vl-utils torch
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder
teacher = Qwen3VLEmbedder(model_name_or_path="Qwen/Qwen3-VL-Embedding-2B")
# Encode document page images
documents = [
{"image": "page_001.png"},
{"image": "page_002.png"},
# ... all document pages in your corpus
]
doc_embeddings = teacher.process(documents) # (N, 2048), L2-normalized
Note: The
Qwen3VLEmbedderclass and full usage guide (including vLLM/SGLang acceleration) are available at the Qwen3-VL-Embedding-2B model page. Document indexing requires a GPU; once indexed, retrieval uses only CPU.
Full Retrieval Pipeline
import numpy as np
from sentence_transformers import SentenceTransformer
# doc_embeddings: (N, 2048) numpy array from teacher indexing above
# Step 1: Encode text queries with NanoVDR (CPU, ~51ms per query)
student = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?") # shape: (2048,)
# Step 2: Retrieve via cosine similarity
scores = query_emb @ doc_embeddings.T
top_k_indices = np.argsort(scores)[-5:][::-1]
How It Works
NanoVDR uses asymmetric cross-modal distillation to decouple query and document encoding:
| Document Encoding (offline) | Query Encoding (online) | |
|---|---|---|
| Model | Qwen3-VL-Embedding-2B (frozen) | NanoVDR-S-Multi (69M) |
| Input | Page images | Text queries (6 languages) |
| Output | 2048-d embedding | 2048-d embedding |
| Hardware | GPU (one-time indexing) | CPU (real-time serving) |
The student is trained to align query embeddings with the teacher's query embeddings via pointwise cosine loss — no document embeddings or hard negatives are needed during training. At inference, student query embeddings are directly compatible with teacher document embeddings.
Training
| Value | |
|---|---|
| Base model | distilbert/distilbert-base-uncased (66M) |
| Projector | 2-layer MLP: 768 → 768 → 2048 (2.4M params) |
| Total params | 69M |
| Objective | Pointwise cosine alignment with teacher query embeddings |
| Training data | 1.49M pairs — 711K original + 778K translated queries |
| Languages | EN (original) + DE, FR, ES, IT, PT (translated via Helsinki-NLP Opus-MT) |
| Epochs | 10 |
| Batch size | 1,024 (effective) |
| Learning rate | 3e-4 (OneCycleLR, 3% warmup) |
| Hardware | 1× H200 GPU |
| Training time | ~10 GPU-hours |
| Embedding caching | ~1 GPU-hour (teacher encodes all queries in text mode) |
Multilingual Augmentation Pipeline
- Extract 489K English queries from the 711K training set
- Translate each to 5 languages using Helsinki-NLP Opus-MT → 778K translated queries
- Re-encode translated queries with the frozen teacher in text mode (15 min on H200)
- Combine: 711K original + 778K translated = 1.49M training pairs
- Train with halved epochs (10 vs 20) and slightly higher lr (3e-4 vs 2e-4) to match total steps
Efficiency
| NanoVDR-S-Multi | DSE-Qwen2 | ColPali | Tomoro-8B | |
|---|---|---|---|---|
| Parameters | 69M | 2,209M | ~3B | 8,000M |
| Query latency (CPU, B=1) | 51 ms | 2,539 ms | 7,300 ms | GPU only |
| Checkpoint size | 274 MB | 8.8 GB | 11.9 GB | 35.1 GB |
| Index type | Single-vector | Single-vector | Multi-vector | Multi-vector |
| Scoring | Cosine | Cosine | MaxSim | MaxSim |
| Index storage (500K pages) | 4.1 GB | 3.1 GB | 128 GB | 128 GB |
Model Variants
NanoVDR-S-Multi is the recommended model. The other variants are provided for research and ablation purposes.
| Model | Backbone | Params | v1 | v2 | v3 | Retention | Latency | Recommended |
|---|---|---|---|---|---|---|---|---|
| NanoVDR-S-Multi | DistilBERT | 69M | 82.2 | 61.9 | 46.5 | 95.1% | 51 ms | Yes |
| NanoVDR-S | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | 51 ms | EN-only |
| NanoVDR-M | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | 101 ms | Ablation |
| NanoVDR-L | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | 109 ms | Ablation |
Key Properties
| Property | Value |
|---|---|
| Output dimension | 2048 (aligned with Qwen3-VL-Embedding-2B) |
| Max sequence length | 512 tokens |
| Supported languages | EN, DE, FR, ES, IT, PT |
| Similarity function | Cosine similarity |
| Pooling | Mean pooling |
| Normalization | L2-normalized |
Citation
@article{nanovdr2026,
title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
journal={arXiv preprint arXiv:2502.XXXXX},
year={2026}
}
License
Apache 2.0
- Downloads last month
- -
Model tree for nanovdr/NanoVDR-S-Multi
Base model
distilbert/distilbert-base-uncasedDatasets used to train nanovdr/NanoVDR-S-Multi
Evaluation results
- NDCG@5 on ViDoRe v1self-reported82.200
- NDCG@5 on ViDoRe v2self-reported61.900