--- title: NanoVDR emoji: 🔍 colorFrom: blue colorTo: indigo sdk: static pinned: false ---

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval

> **Paper**: Our arxiv preprint is currently on hold. Details on training methodology, ablations, and full results will be available once the paper is published. --- **NanoVDR** distills a frozen 2B VLM teacher ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)) into tiny **text-only** query encoders (69–151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in **51 ms** via a DistilBERT forward pass — **no vision model at query time**. Queries and documents both map to the same **2048-dim single vector** inherited from the teacher's embedding space, so retrieval is a plain dot product — FAISS-compatible with no MaxSim pooling. The doc index stores just **4 KB per page** (float16), making NanoVDR **64× more storage-efficient** than multi-vector retrievers like ColPali. ## Models | Model | Backbone | Params | ViDoRe v1 | v2 | v3 | Retention | CPU Latency | |-------|----------|--------|-----------|----|----|-----------|-------------| | [**NanoVDR-S-Multi**](https://huggingface.co/nanovdr/NanoVDR-S-Multi) ⭐ | DistilBERT | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** | **51 ms** | | [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | 51 ms | | [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | 101 ms | | [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | 109 ms | _{NDCG@5 (×100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).} ## Quick Start ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("nanovdr/NanoVDR-S-Multi") query_emb = model.encode(["What was the revenue growth in Q3 2024?"]) # (1, 2048) # Retrieve via cosine similarity against teacher-indexed document embeddings # scores = query_emb @ doc_embeddings.T ``` > Documents must be indexed offline with the teacher VLM. See the [NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi#prerequisites-document-indexing-with-teacher-model) model page for a complete guide. ## Acknowledgements This project has received funding from the **Business Finland** co-innovation programme under grant agreement No. 69/31/2025. It is supported by the [AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations](https://aifieldwork.aalto.fi/events/) project (2025–2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project.

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoderfor Visual Document Retrieval

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval