Spaces:

nanovdr
/

README

Running

App Files Files Community

README / README.md

Ryenhails

Upload README.md with huggingface_hub

368855a verified about 8 hours ago

preview code

raw

history blame contribute delete

3.48 kB

metadata

title: NanoVDR
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval

Demo | GitHub | Dataset

Paper: Our arxiv preprint is currently on hold. Details on training methodology, ablations, and full results will be available once the paper is published.

NanoVDR distills a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B) into tiny text-only query encoders (69–151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in 51 ms via a DistilBERT forward pass — no vision model at query time.

Queries and documents both map to the same 2048-dim single vector inherited from the teacher's embedding space, so retrieval is a plain dot product — FAISS-compatible with no MaxSim pooling. The doc index stores just 4 KB per page (float16), making NanoVDR 64× more storage-efficient than multi-vector retrievers like ColPali.

Models

Model	Backbone	Params	ViDoRe v1	v2	v3	Retention	CPU Latency
NanoVDR-S-Multi ⭐	DistilBERT	69M	82.2	61.9	46.5	95.1%	51 ms
NanoVDR-S	DistilBERT	69M	82.2	60.5	43.5	92.4%	51 ms
NanoVDR-M	BERT-base	112M	82.1	62.2	44.7	94.0%	101 ms
NanoVDR-L	ModernBERT	151M	82.4	61.5	44.2	93.4%	109 ms

_{NDCG@5 (×100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).}

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = model.encode(["What was the revenue growth in Q3 2024?"])  # (1, 2048)

# Retrieve via cosine similarity against teacher-indexed document embeddings
# scores = query_emb @ doc_embeddings.T

Documents must be indexed offline with the teacher VLM. See the NanoVDR-S-Multi model page for a complete guide.

Acknowledgements

This project has received funding from the Business Finland co-innovation programme under grant agreement No. 69/31/2025. It is supported by the AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations project (2025–2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project.

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoderfor Visual Document Retrieval

Models

Quick Start

Acknowledgements

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval