NanoVDR

NanoVDR-M

BERT-base ablation variant. For production use, we recommend NanoVDR-S-Multi.

NanoVDR-M is a 112M-parameter text-only query encoder for visual document retrieval, trained via asymmetric cross-modal distillation from Qwen3-VL-Embedding-2B. It uses BERT-base + a 2-layer MLP projector and achieves the highest v2 score among all NanoVDR variants.

Highlights

  • Single-vector retrieval — queries and documents share the same 2048-dim embedding space as Qwen3-VL-Embedding-2B; retrieval is a plain dot product, FAISS-compatible, 4 KB per page (float16)
  • Lightweight on storage — 454 MB model; doc index costs 64× less than ColPali's multi-vector patches
  • Asymmetric setup — tiny 112M text encoder at query time; large VLM indexes documents offline once

Results

Model Params ViDoRe v1 ViDoRe v2 ViDoRe v3 Avg Retention
Qwen3-VL-Emb (Teacher) 2.0B 84.3 65.3 50.0 —
NanoVDR-M 112M 82.1 62.2 44.7 94.0%
NanoVDR-S-Multi 69M 82.2 61.9 46.5 95.1%

NDCG@5 (×100). Retention = Student / Teacher averaged across v1/v2/v3.

Usage

Prerequisite: Documents must be indexed offline using Qwen3-VL-Embedding-2B (the teacher model). See the NanoVDR-S-Multi model page for a complete indexing guide.

from sentence_transformers import SentenceTransformer
import numpy as np

# doc_embeddings: (N, 2048) from teacher indexing (see prerequisite above)

model = SentenceTransformer("nanovdr/NanoVDR-M")
query_embeddings = model.encode(["What was the revenue growth in Q3?"])  # (1, 2048)

scores = query_embeddings @ doc_embeddings.T
top_k_indices = np.argsort(scores[0])[-5:][::-1]

Training Details

Value
Architecture BERT-base (109M) + MLP projector (768 → 768 → 2048, 2.4M) = 112M
Objective Pointwise cosine alignment with teacher query embeddings
Data 711K query-document pairs
Epochs / lr 20 / 2e-4
Training cost ~10.5 GPU-hours (1× H200)
CPU query latency 101 ms

All NanoVDR Models

Model Backbone Params v1 v2 v3 Retention
NanoVDR-S-Multi DistilBERT 69M 82.2 61.9 46.5 95.1%
NanoVDR-S DistilBERT 69M 82.2 60.5 43.5 92.4%
NanoVDR-M BERT-base 112M 82.1 62.2 44.7 94.0%
NanoVDR-L ModernBERT 151M 82.4 61.5 44.2 93.4%

Citation

@article{nanovdr2026,
  title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
  author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
  journal={arXiv preprint arXiv:2502.XXXXX},
  year={2026}
}

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nanovdr/NanoVDR-M

Finetuned
(6407)
this model

Datasets used to train nanovdr/NanoVDR-M

Evaluation results