---
title: NanoVDR
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval
Demo |
GitHub |
Dataset
> **Paper**: Our arxiv preprint is currently on hold. Details on training methodology, ablations, and full results will be available once the paper is published.
---
**NanoVDR** distills a frozen 2B VLM teacher ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)) into tiny **text-only** query encoders (69–151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in **51 ms** via a DistilBERT forward pass — **no vision model at query time**.
Queries and documents both map to the same **2048-dim single vector** inherited from the teacher's embedding space, so retrieval is a plain dot product — FAISS-compatible with no MaxSim pooling. The doc index stores just **4 KB per page** (float16), making NanoVDR **64× more storage-efficient** than multi-vector retrievers like ColPali.
## Models
| Model | Backbone | Params | ViDoRe v1 | v2 | v3 | Retention | CPU Latency |
|-------|----------|--------|-----------|----|----|-----------|-------------|
| [**NanoVDR-S-Multi**](https://huggingface.co/nanovdr/NanoVDR-S-Multi) ⭐ | DistilBERT | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** | **51 ms** |
| [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | 51 ms |
| [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | 101 ms |
| [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | 109 ms |
NDCG@5 (×100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).
## Quick Start
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = model.encode(["What was the revenue growth in Q3 2024?"]) # (1, 2048)
# Retrieve via cosine similarity against teacher-indexed document embeddings
# scores = query_emb @ doc_embeddings.T
```
> Documents must be indexed offline with the teacher VLM. See the [NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi#prerequisites-document-indexing-with-teacher-model) model page for a complete guide.
## Acknowledgements
This project has received funding from the **Business Finland** co-innovation programme under grant agreement No. 69/31/2025. It is supported by the [AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations](https://aifieldwork.aalto.fi/events/) project (2025–2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project.