Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

NanoVDR-S-Multi

The recommended NanoVDR model for production use.

NanoVDR-S-Multi is a 69M-parameter multilingual text-only query encoder for visual document retrieval. It encodes text queries into the same embedding space as a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B), so you can retrieve document page images using only a DistilBERT forward pass — no vision model at query time.

Highlights

95.1% teacher retention — a 69M text-only model recovers 95% of a 2B VLM teacher across 22 ViDoRe datasets
Outperforms DSE-Qwen2 (2B) on multilingual v2 (+6.2) and v3 (+4.1) with 32x fewer parameters
Outperforms ColPali (~3B) on multilingual v2 (+7.2) and v3 (+4.5) with single-vector cosine retrieval (no MaxSim)
Single-vector retrieval — queries and documents share the same 2048-dim embedding space as Qwen3-VL-Embedding-2B; retrieval is a plain dot product, FAISS-compatible, 4 KB per page (float16)
Lightweight on storage — 282 MB model file; doc index costs 64× less than ColPali's multi-vector patches
51 ms CPU query latency — 50x faster than DSE-Qwen2, 143x faster than ColPali
6 languages: English, German, French, Spanish, Italian, Portuguese — all >92% teacher retention

Results

Model	Type	Params	ViDoRe v1	ViDoRe v2	ViDoRe v3	Avg Retention
Tomoro-8B	VLM	8.0B	90.6	65.0	59.0	—
Qwen3-VL-Emb (Teacher)	VLM	2.0B	84.3	65.3	50.0	—
DSE-Qwen2	VLM	2.2B	85.1	55.7	42.4	—
ColPali	VLM	~3B	84.2	54.7	42.0	—
NanoVDR-S-Multi	Text-only	69M	82.2	61.9	46.5	95.1%

_{NDCG@5 (×100). v1 = 10 English datasets, v2 = 4 multilingual datasets, v3 = 8 multilingual datasets.}

Per-Language Retention (v2 + v3, 19,537 queries)

Language	#Queries	Teacher	NanoVDR-S-Multi	Retention
English	6,237	64.0	60.3	94.3%
French	2,694	51.0	47.8	93.6%
Portuguese	2,419	48.7	46.1	94.6%
Spanish	2,694	51.4	47.8	93.1%
Italian	2,419	49.0	45.7	93.3%
German	2,694	49.3	45.4	92.0%

All 6 languages achieve >92% of the 2B teacher's performance.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")

queries = [
    "What was the revenue growth in Q3 2024?",           # English
    "Quel est le chiffre d'affaires du trimestre?",       # French
    "Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
    "¿Cuál fue el crecimiento de ingresos en el Q3?",     # Spanish
    "Qual foi o crescimento da receita no terceiro trimestre?", # Portuguese
    "Qual è stata la crescita dei ricavi nel terzo trimestre?", # Italian
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape)  # (6, 2048)

# Cosine similarity against pre-indexed document embeddings
# scores = query_embeddings @ doc_embeddings.T

Prerequisites: Document Indexing with Teacher Model

NanoVDR is a query encoder only. Documents must be indexed offline using the teacher VLM (Qwen3-VL-Embedding-2B), which encodes page images into 2048-d embeddings. This is a one-time cost.

# pip install transformers qwen-vl-utils torch
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder

teacher = Qwen3VLEmbedder(model_name_or_path="Qwen/Qwen3-VL-Embedding-2B")

# Encode document page images
documents = [
    {"image": "page_001.png"},
    {"image": "page_002.png"},
    # ... all document pages in your corpus
]
doc_embeddings = teacher.process(documents)  # (N, 2048), L2-normalized

Note: The Qwen3VLEmbedder class and full usage guide (including vLLM/SGLang acceleration) are available at the Qwen3-VL-Embedding-2B model page. Document indexing requires a GPU; once indexed, retrieval uses only CPU.

Full Retrieval Pipeline

import numpy as np
from sentence_transformers import SentenceTransformer

# doc_embeddings: (N, 2048) numpy array from teacher indexing above

# Step 1: Encode text queries with NanoVDR (CPU, ~51ms per query)
student = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?")  # shape: (2048,)

# Step 2: Retrieve via cosine similarity
scores = query_emb @ doc_embeddings.T
top_k_indices = np.argsort(scores)[-5:][::-1]

How It Works

NanoVDR uses asymmetric cross-modal distillation to decouple query and document encoding:

	Document Encoding (offline)	Query Encoding (online)
Model	Qwen3-VL-Embedding-2B (frozen)	NanoVDR-S-Multi (69M)
Input	Page images	Text queries (6 languages)
Output	2048-d embedding	2048-d embedding
Hardware	GPU (one-time indexing)	CPU (real-time serving)

The student is trained to align query embeddings with the teacher's query embeddings via pointwise cosine loss — no document embeddings or hard negatives are needed during training. At inference, student query embeddings are directly compatible with teacher document embeddings.

Training

	Value
Base model	`distilbert/distilbert-base-uncased` (66M)
Projector	2-layer MLP: 768 → 768 → 2048 (2.4M params)
Total params	69M
Objective	Pointwise cosine alignment with teacher query embeddings
Training data	1.49M pairs — 711K original + 778K translated queries
Languages	EN (original) + DE, FR, ES, IT, PT (translated via Helsinki-NLP Opus-MT)
Epochs	10
Batch size	1,024 (effective)
Learning rate	3e-4 (OneCycleLR, 3% warmup)
Hardware	1× H200 GPU
Training time	~10 GPU-hours
Embedding caching	~1 GPU-hour (teacher encodes all queries in text mode)

Multilingual Augmentation Pipeline

Extract 489K English queries from the 711K training set
Translate each to 5 languages using Helsinki-NLP Opus-MT → 778K translated queries
Re-encode translated queries with the frozen teacher in text mode (15 min on H200)
Combine: 711K original + 778K translated = 1.49M training pairs
Train with halved epochs (10 vs 20) and slightly higher lr (3e-4 vs 2e-4) to match total steps

Efficiency

	NanoVDR-S-Multi	DSE-Qwen2	ColPali	Tomoro-8B
Parameters	69M	2,209M	~3B	8,000M
Query latency (CPU, B=1)	51 ms	2,539 ms	7,300 ms	GPU only
Checkpoint size	274 MB	8.8 GB	11.9 GB	35.1 GB
Index type	Single-vector	Single-vector	Multi-vector	Multi-vector
Scoring	Cosine	Cosine	MaxSim	MaxSim
Index storage (500K pages)	4.1 GB	3.1 GB	128 GB	128 GB

Model Variants

NanoVDR-S-Multi is the recommended model. The other variants are provided for research and ablation purposes.

Model	Backbone	Params	v1	v2	v3	Retention	Latency	Recommended
NanoVDR-S-Multi	DistilBERT	69M	82.2	61.9	46.5	95.1%	51 ms	Yes
NanoVDR-S	DistilBERT	69M	82.2	60.5	43.5	92.4%	51 ms	EN-only
NanoVDR-M	BERT-base	112M	82.1	62.2	44.7	94.0%	101 ms	Ablation
NanoVDR-L	ModernBERT	151M	82.4	61.5	44.2	93.4%	109 ms	Ablation

Key Properties

Property	Value
Output dimension	2048 (aligned with Qwen3-VL-Embedding-2B)
Max sequence length	512 tokens
Supported languages	EN, DE, FR, ES, IT, PT
Similarity function	Cosine similarity
Pooling	Mean pooling
Normalization	L2-normalized

Citation

@article{nanovdr2026,
  title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
  author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
  journal={arXiv preprint arXiv:2603.12824},
  year={2026}
}

License

Apache 2.0

Downloads last month: 24

Model tree for nanovdr/NanoVDR-S-Multi

Base model

distilbert/distilbert-base-uncased

Finetuned

(12020)

this model

Datasets used to train nanovdr/NanoVDR-S-Multi

Spaces using nanovdr/NanoVDR-S-Multi 10

Paper for nanovdr/NanoVDR-S-Multi

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Paper • 2603.12824 • Published Mar 13 • 6

Evaluation results

NDCG@5 on ViDoRe v1
self-reported

82.200
NDCG@5 on ViDoRe v2
self-reported

61.900