| | --- |
| | title: NanoVDR |
| | emoji: π |
| | colorFrom: blue |
| | colorTo: indigo |
| | sdk: static |
| | pinned: false |
| | --- |
| | |
| | <p align="center"> |
| | <img width="480" src="https://huggingface.co/nanovdr/NanoVDR-S-Multi/resolve/main/banner.png" alt="NanoVDR"/> |
| | </p> |
| |
|
| | <h3 align="center">Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder<br>for Visual Document Retrieval</h3> |
| |
|
| | <p align="center"> |
| | <a href="https://huggingface.co/spaces/nanovdr/NanoVDR-Demo">Demo</a> | |
| | <a href="https://github.com/nanovdr/nanovdr">GitHub</a> | |
| | <a href="https://huggingface.co/datasets/nanovdr/NanoVDR-Train">Dataset</a> |
| | </p> |
| |
|
| | > **Paper**: Our arxiv preprint is currently on hold. Details on training methodology, ablations, and full results will be available once the paper is published. |
| |
|
| | --- |
| |
|
| | **NanoVDR** distills a frozen 2B VLM teacher ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)) into tiny **text-only** query encoders (69β151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in **51 ms** via a DistilBERT forward pass β **no vision model at query time**. |
| |
|
| | Queries and documents both map to the same **2048-dim single vector** inherited from the teacher's embedding space, so retrieval is a plain dot product β FAISS-compatible with no MaxSim pooling. The doc index stores just **4 KB per page** (float16), making NanoVDR **64Γ more storage-efficient** than multi-vector retrievers like ColPali. |
| |
|
| | ## Models |
| |
|
| | | Model | Backbone | Params | ViDoRe v1 | v2 | v3 | Retention | CPU Latency | |
| | |-------|----------|--------|-----------|----|----|-----------|-------------| |
| | | [**NanoVDR-S-Multi**](https://huggingface.co/nanovdr/NanoVDR-S-Multi) β | DistilBERT | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** | **51 ms** | |
| | | [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | 51 ms | |
| | | [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | 101 ms | |
| | | [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | 109 ms | |
| |
|
| | <sub>NDCG@5 (Γ100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).</sub> |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer("nanovdr/NanoVDR-S-Multi") |
| | query_emb = model.encode(["What was the revenue growth in Q3 2024?"]) # (1, 2048) |
| | |
| | # Retrieve via cosine similarity against teacher-indexed document embeddings |
| | # scores = query_emb @ doc_embeddings.T |
| | ``` |
| |
|
| | > Documents must be indexed offline with the teacher VLM. See the [NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi#prerequisites-document-indexing-with-teacher-model) model page for a complete guide. |
| |
|
| | ## Acknowledgements |
| |
|
| | This project has received funding from the **Business Finland** co-innovation programme under grant agreement No. 69/31/2025. It is supported by the [AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations](https://aifieldwork.aalto.fi/events/) project (2025β2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project. |
| |
|