Spaces:

nanovdr
/

README

Running

App Files Files Community

README / README.md

Ryenhails

Upload README.md with huggingface_hub

368855a verified 1 day ago

preview code

raw

history blame contribute delete

3.48 kB

	---
	title: NanoVDR
	emoji: 🔍
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	---

	<p align="center">
	<img width="480" src="https://huggingface.co/nanovdr/NanoVDR-S-Multi/resolve/main/banner.png" alt="NanoVDR"/>
	</p>

	<h3 align="center">Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder<br>for Visual Document Retrieval</h3>

	<p align="center">
	<a href="https://huggingface.co/spaces/nanovdr/NanoVDR-Demo">Demo</a>  \|
	<a href="https://github.com/nanovdr/nanovdr">GitHub</a>  \|
	<a href="https://huggingface.co/datasets/nanovdr/NanoVDR-Train">Dataset</a>
	</p>

	> Paper: Our arxiv preprint is currently on hold. Details on training methodology, ablations, and full results will be available once the paper is published.

	---

	NanoVDR distills a frozen 2B VLM teacher ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)) into tiny text-only query encoders (69–151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in 51 ms via a DistilBERT forward pass — no vision model at query time.

	Queries and documents both map to the same 2048-dim single vector inherited from the teacher's embedding space, so retrieval is a plain dot product — FAISS-compatible with no MaxSim pooling. The doc index stores just 4 KB per page (float16), making NanoVDR 64× more storage-efficient than multi-vector retrievers like ColPali.

	## Models

	\| Model \| Backbone \| Params \| ViDoRe v1 \| v2 \| v3 \| Retention \| CPU Latency \|
	\|-------\|----------\|--------\|-----------\|----\|----\|-----------\|-------------\|
	\| [NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi) ⭐ \| DistilBERT \| 69M \| 82.2 \| 61.9 \| 46.5 \| 95.1% \| 51 ms \|
	\| [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) \| DistilBERT \| 69M \| 82.2 \| 60.5 \| 43.5 \| 92.4% \| 51 ms \|
	\| [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) \| BERT-base \| 112M \| 82.1 \| 62.2 \| 44.7 \| 94.0% \| 101 ms \|
	\| [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) \| ModernBERT \| 151M \| 82.4 \| 61.5 \| 44.2 \| 93.4% \| 109 ms \|

	<sub>NDCG@5 (×100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).</sub>

	## Quick Start

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
	query_emb = model.encode(["What was the revenue growth in Q3 2024?"]) # (1, 2048)

	# Retrieve via cosine similarity against teacher-indexed document embeddings
	# scores = query_emb @ doc_embeddings.T
	```

	> Documents must be indexed offline with the teacher VLM. See the [NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi#prerequisites-document-indexing-with-teacher-model) model page for a complete guide.

	## Acknowledgements

	This project has received funding from the Business Finland co-innovation programme under grant agreement No. 69/31/2025. It is supported by the [AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations](https://aifieldwork.aalto.fi/events/) project (2025–2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project.