NanoVDR-S
English-only baseline variant. For production use (especially with multilingual queries), we recommend NanoVDR-S-Multi.
NanoVDR-S is a 69M-parameter text-only query encoder for visual document retrieval, trained via asymmetric cross-modal distillation from Qwen3-VL-Embedding-2B. It uses DistilBERT + a 2-layer MLP projector to encode text queries into the teacher's embedding space.
Highlights
- Single-vector retrieval — queries and documents share the same 2048-dim embedding space as Qwen3-VL-Embedding-2B; retrieval is a plain dot product, FAISS-compatible, 4 KB per page (float16)
- Lightweight on storage — 282 MB model; doc index costs 64× less than ColPali's multi-vector patches
- Asymmetric setup — tiny 69M text encoder at query time; large VLM indexes documents offline once
Results
| Model | Params | ViDoRe v1 | ViDoRe v2 | ViDoRe v3 | Avg Retention |
|---|---|---|---|---|---|
| Qwen3-VL-Emb (Teacher) | 2.0B | 84.3 | 65.3 | 50.0 | — |
| NanoVDR-S | 69M | 82.2 | 60.5 | 43.5 | 92.4% |
| NanoVDR-S-Multi | 69M | 82.2 | 61.9 | 46.5 | 95.1% |
NDCG@5 (×100). Retention = Student / Teacher averaged across v1/v2/v3.
Usage
Prerequisite: Documents must be indexed offline using Qwen3-VL-Embedding-2B (the teacher model). See the NanoVDR-S-Multi model page for a complete indexing guide.
from sentence_transformers import SentenceTransformer
import numpy as np
# doc_embeddings: (N, 2048) from teacher indexing (see prerequisite above)
model = SentenceTransformer("nanovdr/NanoVDR-S")
query_embeddings = model.encode(["What was the revenue growth in Q3?"]) # (1, 2048)
scores = query_embeddings @ doc_embeddings.T
top_k_indices = np.argsort(scores[0])[-5:][::-1]
Training Details
| Value | |
|---|---|
| Architecture | DistilBERT (66M) + MLP projector (768 → 768 → 2048, 2.4M) = 69M |
| Objective | Pointwise cosine alignment with teacher query embeddings |
| Data | 711K query-document pairs |
| Epochs / lr | 20 / 2e-4 |
| Training cost | ~10 GPU-hours (1× H200) |
| CPU query latency | 51 ms |
All NanoVDR Models
| Model | Backbone | Params | v1 | v2 | v3 | Retention |
|---|---|---|---|---|---|---|
| NanoVDR-S-Multi | DistilBERT | 69M | 82.2 | 61.9 | 46.5 | 95.1% |
| NanoVDR-S | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% |
| NanoVDR-M | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% |
| NanoVDR-L | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% |
Citation
@article{nanovdr2026,
title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
journal={arXiv preprint arXiv:2502.XXXXX},
year={2026}
}
License
Apache 2.0
- Downloads last month
- -
Model tree for nanovdr/NanoVDR-S
Base model
distilbert/distilbert-base-uncasedDatasets used to train nanovdr/NanoVDR-S
Evaluation results
- NDCG@5 on ViDoRe v1self-reported82.200
- NDCG@5 on ViDoRe v2self-reported60.500