NanoVDR

Paper | GitHub | All Models

NanoVDR-S-Multi

The recommended NanoVDR model for production use.

NanoVDR-S-Multi is a 69M-parameter multilingual text-only query encoder for visual document retrieval. It encodes text queries into the same embedding space as a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B), so you can retrieve document page images using only a DistilBERT forward pass — no vision model at query time.

Highlights

  • 95.1% teacher retention — a 69M text-only model recovers 95% of a 2B VLM teacher across 22 ViDoRe datasets
  • Outperforms DSE-Qwen2 (2B) on multilingual v2 (+6.2) and v3 (+4.1) with 32x fewer parameters
  • Outperforms ColPali (~3B) on multilingual v2 (+7.2) and v3 (+4.5) with single-vector cosine retrieval (no MaxSim)
  • Single-vector retrieval — queries and documents share the same 2048-dim embedding space as Qwen3-VL-Embedding-2B; retrieval is a plain dot product, FAISS-compatible, 4 KB per page (float16)
  • Lightweight on storage — 282 MB model file; doc index costs 64× less than ColPali's multi-vector patches
  • 51 ms CPU query latency — 50x faster than DSE-Qwen2, 143x faster than ColPali
  • 6 languages: English, German, French, Spanish, Italian, Portuguese — all >92% teacher retention

Results

Model Type Params ViDoRe v1 ViDoRe v2 ViDoRe v3 Avg Retention
Tomoro-8B VLM 8.0B 90.6 65.0 59.0
Qwen3-VL-Emb (Teacher) VLM 2.0B 84.3 65.3 50.0
DSE-Qwen2 VLM 2.2B 85.1 55.7 42.4
ColPali VLM ~3B 84.2 54.7 42.0
NanoVDR-S-Multi Text-only 69M 82.2 61.9 46.5 95.1%

NDCG@5 (×100). v1 = 10 English datasets, v2 = 4 multilingual datasets, v3 = 8 multilingual datasets.

Per-Language Retention (v2 + v3, 19,537 queries)

Language #Queries Teacher NanoVDR-S-Multi Retention
English 6,237 64.0 60.3 94.3%
French 2,694 51.0 47.8 93.6%
Portuguese 2,419 48.7 46.1 94.6%
Spanish 2,694 51.4 47.8 93.1%
Italian 2,419 49.0 45.7 93.3%
German 2,694 49.3 45.4 92.0%

All 6 languages achieve >92% of the 2B teacher's performance.


Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")

queries = [
    "What was the revenue growth in Q3 2024?",           # English
    "Quel est le chiffre d'affaires du trimestre?",       # French
    "Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
    "¿Cuál fue el crecimiento de ingresos en el Q3?",     # Spanish
    "Qual foi o crescimento da receita no terceiro trimestre?", # Portuguese
    "Qual è stata la crescita dei ricavi nel terzo trimestre?", # Italian
]
query_embeddings = model.encode(queries)
print(query_embeddings.shape)  # (6, 2048)

# Cosine similarity against pre-indexed document embeddings
# scores = query_embeddings @ doc_embeddings.T

Prerequisites: Document Indexing with Teacher Model

NanoVDR is a query encoder only. Documents must be indexed offline using the teacher VLM (Qwen3-VL-Embedding-2B), which encodes page images into 2048-d embeddings. This is a one-time cost.

# pip install transformers qwen-vl-utils torch
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder

teacher = Qwen3VLEmbedder(model_name_or_path="Qwen/Qwen3-VL-Embedding-2B")

# Encode document page images
documents = [
    {"image": "page_001.png"},
    {"image": "page_002.png"},
    # ... all document pages in your corpus
]
doc_embeddings = teacher.process(documents)  # (N, 2048), L2-normalized

Note: The Qwen3VLEmbedder class and full usage guide (including vLLM/SGLang acceleration) are available at the Qwen3-VL-Embedding-2B model page. Document indexing requires a GPU; once indexed, retrieval uses only CPU.

Full Retrieval Pipeline

import numpy as np
from sentence_transformers import SentenceTransformer

# doc_embeddings: (N, 2048) numpy array from teacher indexing above

# Step 1: Encode text queries with NanoVDR (CPU, ~51ms per query)
student = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = student.encode("Quel est le chiffre d'affaires?")  # shape: (2048,)

# Step 2: Retrieve via cosine similarity
scores = query_emb @ doc_embeddings.T
top_k_indices = np.argsort(scores)[-5:][::-1]

How It Works

NanoVDR uses asymmetric cross-modal distillation to decouple query and document encoding:

Document Encoding (offline) Query Encoding (online)
Model Qwen3-VL-Embedding-2B (frozen) NanoVDR-S-Multi (69M)
Input Page images Text queries (6 languages)
Output 2048-d embedding 2048-d embedding
Hardware GPU (one-time indexing) CPU (real-time serving)

The student is trained to align query embeddings with the teacher's query embeddings via pointwise cosine loss — no document embeddings or hard negatives are needed during training. At inference, student query embeddings are directly compatible with teacher document embeddings.


Training

Value
Base model distilbert/distilbert-base-uncased (66M)
Projector 2-layer MLP: 768 → 768 → 2048 (2.4M params)
Total params 69M
Objective Pointwise cosine alignment with teacher query embeddings
Training data 1.49M pairs — 711K original + 778K translated queries
Languages EN (original) + DE, FR, ES, IT, PT (translated via Helsinki-NLP Opus-MT)
Epochs 10
Batch size 1,024 (effective)
Learning rate 3e-4 (OneCycleLR, 3% warmup)
Hardware 1× H200 GPU
Training time ~10 GPU-hours
Embedding caching ~1 GPU-hour (teacher encodes all queries in text mode)

Multilingual Augmentation Pipeline

  1. Extract 489K English queries from the 711K training set
  2. Translate each to 5 languages using Helsinki-NLP Opus-MT → 778K translated queries
  3. Re-encode translated queries with the frozen teacher in text mode (15 min on H200)
  4. Combine: 711K original + 778K translated = 1.49M training pairs
  5. Train with halved epochs (10 vs 20) and slightly higher lr (3e-4 vs 2e-4) to match total steps

Efficiency

NanoVDR-S-Multi DSE-Qwen2 ColPali Tomoro-8B
Parameters 69M 2,209M ~3B 8,000M
Query latency (CPU, B=1) 51 ms 2,539 ms 7,300 ms GPU only
Checkpoint size 274 MB 8.8 GB 11.9 GB 35.1 GB
Index type Single-vector Single-vector Multi-vector Multi-vector
Scoring Cosine Cosine MaxSim MaxSim
Index storage (500K pages) 4.1 GB 3.1 GB 128 GB 128 GB

Model Variants

NanoVDR-S-Multi is the recommended model. The other variants are provided for research and ablation purposes.

Model Backbone Params v1 v2 v3 Retention Latency Recommended
NanoVDR-S-Multi DistilBERT 69M 82.2 61.9 46.5 95.1% 51 ms Yes
NanoVDR-S DistilBERT 69M 82.2 60.5 43.5 92.4% 51 ms EN-only
NanoVDR-M BERT-base 112M 82.1 62.2 44.7 94.0% 101 ms Ablation
NanoVDR-L ModernBERT 151M 82.4 61.5 44.2 93.4% 109 ms Ablation

Key Properties

Property Value
Output dimension 2048 (aligned with Qwen3-VL-Embedding-2B)
Max sequence length 512 tokens
Supported languages EN, DE, FR, ES, IT, PT
Similarity function Cosine similarity
Pooling Mean pooling
Normalization L2-normalized

Citation

@article{nanovdr2026,
  title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
  author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
  journal={arXiv preprint arXiv:2502.XXXXX},
  year={2026}
}

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nanovdr/NanoVDR-S-Multi

Finetuned
(10960)
this model

Datasets used to train nanovdr/NanoVDR-S-Multi

Evaluation results