Upload README.md with huggingface_hub

6a88145 verified 6 days ago

10.1 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- visual-document-retrieval
	- cross-modal-distillation
	- knowledge-distillation
	- document-retrieval
	- multilingual
	- nanovdr
	base_model: distilbert/distilbert-base-uncased
	language:
	- en
	- de
	- fr
	- es
	- it
	- pt
	license: apache-2.0
	datasets:
	- openbmb/VisRAG-Ret-Train-Synthetic-data
	- openbmb/VisRAG-Ret-Train-In-domain-data
	- vidore/colpali_train_set
	- llamaindex/vdr-multilingual-train
	model-index:
	- name: NanoVDR-S-Multi
	results:
	- task:
	type: retrieval
	dataset:
	name: ViDoRe v1
	type: vidore/vidore-benchmark-667173f98e70a1c0fa4d
	metrics:
	- name: NDCG@5
	type: ndcg_at_5
	value: 82.2
	- task:
	type: retrieval
	dataset:
	name: ViDoRe v2
	type: vidore/vidore-benchmark-v2
	metrics:
	- name: NDCG@5
	type: ndcg_at_5
	value: 61.9
	---

	<p align="center">
	<img width="560" src="banner.png" alt="NanoVDR"/>
	</p>

	<p align="center">
	<a href="https://github.com/nanovdr/nanovdr">GitHub</a> \|
	<a href="https://huggingface.co/collections/nanovdr/nanovdr">All Models</a>
	</p>

	> Paper: Our arxiv preprint is currently on hold. Details on training methodology, ablations, and full results will be available once the paper is published.

	# NanoVDR-S-Multi

	The recommended NanoVDR model for production use.

	NanoVDR-S-Multi is a 69M-parameter multilingual text-only query encoder for visual document retrieval. It encodes text queries into the same embedding space as a frozen 2B VLM teacher ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)), so you can retrieve document page images using only a DistilBERT forward pass — no vision model at query time.

	### Highlights

	- 95.1% teacher retention — a 69M text-only model recovers 95% of a 2B VLM teacher across 22 ViDoRe datasets
	- Outperforms DSE-Qwen2 (2B) on multilingual v2 (+6.2) and v3 (+4.1) with 32x fewer parameters
	- Outperforms ColPali (~3B) on multilingual v2 (+7.2) and v3 (+4.5) with single-vector cosine retrieval (no MaxSim)
	- Single-vector retrieval — queries and documents share the same 2048-dim embedding space as [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B); retrieval is a plain dot product, FAISS-compatible, 4 KB per page (float16)
	- Lightweight on storage — 282 MB model file; doc index costs 64× less than ColPali's multi-vector patches
	- 51 ms CPU query latency — 50x faster than DSE-Qwen2, 143x faster than ColPali
	- 6 languages: English, German, French, Spanish, Italian, Portuguese — all >92% teacher retention

	---

	## Results

	\| Model \| Type \| Params \| ViDoRe v1 \| ViDoRe v2 \| ViDoRe v3 \| Avg Retention \|
	\|-------\|------\|--------\|-----------\|-----------\|-----------\|---------------\|
	\| Tomoro-8B \| VLM \| 8.0B \| 90.6 \| 65.0 \| 59.0 \| — \|
	\| Qwen3-VL-Emb (Teacher) \| VLM \| 2.0B \| 84.3 \| 65.3 \| 50.0 \| — \|
	\| DSE-Qwen2 \| VLM \| 2.2B \| 85.1 \| 55.7 \| 42.4 \| — \|
	\| ColPali \| VLM \| ~3B \| 84.2 \| 54.7 \| 42.0 \| — \|
	\| NanoVDR-S-Multi \| Text-only \| 69M \| 82.2 \| 61.9 \| 46.5 \| 95.1% \|

	<sub>NDCG@5 (×100). v1 = 10 English datasets, v2 = 4 multilingual datasets, v3 = 8 multilingual datasets.</sub>

	### Per-Language Retention (v2 + v3, 19,537 queries)

	\| Language \| #Queries \| Teacher \| NanoVDR-S-Multi \| Retention \|
	\|----------\|----------\|---------\|-----------------\|-----------\|
	\| English \| 6,237 \| 64.0 \| 60.3 \| 94.3% \|
	\| French \| 2,694 \| 51.0 \| 47.8 \| 93.6% \|
	\| Portuguese \| 2,419 \| 48.7 \| 46.1 \| 94.6% \|
	\| Spanish \| 2,694 \| 51.4 \| 47.8 \| 93.1% \|
	\| Italian \| 2,419 \| 49.0 \| 45.7 \| 93.3% \|
	\| German \| 2,694 \| 49.3 \| 45.4 \| 92.0% \|

	All 6 languages achieve >92% of the 2B teacher's performance.

	---

	## Quick Start

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")

	queries = [
	"What was the revenue growth in Q3 2024?", # English
	"Quel est le chiffre d'affaires du trimestre?", # French
	"Wie hoch war das Umsatzwachstum im dritten Quartal?", # German
	"¿Cuál fue el crecimiento de ingresos en el Q3?", # Spanish
	"Qual foi o crescimento da receita no terceiro trimestre?", # Portuguese
	"Qual è stata la crescita dei ricavi nel terzo trimestre?", # Italian
	]
	query_embeddings = model.encode(queries)
	print(query_embeddings.shape) # (6, 2048)

	# Cosine similarity against pre-indexed document embeddings
	# scores = query_embeddings @ doc_embeddings.T
	```

	### Prerequisites: Document Indexing with Teacher Model

	NanoVDR is a query encoder only. Documents must be indexed offline using the teacher VLM ([Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B)), which encodes page images into 2048-d embeddings. This is a one-time cost.

	```python
	# pip install transformers qwen-vl-utils torch
	from scripts.qwen3_vl_embedding import Qwen3VLEmbedder

	teacher = Qwen3VLEmbedder(model_name_or_path="Qwen/Qwen3-VL-Embedding-2B")

	# Encode document page images
	documents = [
	{"image": "page_001.png"},
	{"image": "page_002.png"},
	# ... all document pages in your corpus
	]
	doc_embeddings = teacher.process(documents) # (N, 2048), L2-normalized
	```

	> Note: The `Qwen3VLEmbedder` class and full usage guide (including vLLM/SGLang acceleration) are available at the [Qwen3-VL-Embedding-2B model page](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B). Document indexing requires a GPU; once indexed, retrieval uses only CPU.

	### Full Retrieval Pipeline

	```python
	import numpy as np
	from sentence_transformers import SentenceTransformer

	# doc_embeddings: (N, 2048) numpy array from teacher indexing above

	# Step 1: Encode text queries with NanoVDR (CPU, ~51ms per query)
	student = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
	query_emb = student.encode("Quel est le chiffre d'affaires?") # shape: (2048,)

	# Step 2: Retrieve via cosine similarity
	scores = query_emb @ doc_embeddings.T
	top_k_indices = np.argsort(scores)[-5:][::-1]
	```

	---

	## How It Works

	NanoVDR uses asymmetric cross-modal distillation to decouple query and document encoding:

	\| \| Document Encoding (offline) \| Query Encoding (online) \|
	\|-\|----------------------------\|------------------------\|
	\| Model \| Qwen3-VL-Embedding-2B (frozen) \| NanoVDR-S-Multi (69M) \|
	\| Input \| Page images \| Text queries (6 languages) \|
	\| Output \| 2048-d embedding \| 2048-d embedding \|
	\| Hardware \| GPU (one-time indexing) \| CPU (real-time serving) \|

	The student is trained to align query embeddings with the teacher's query embeddings via pointwise cosine loss — no document embeddings or hard negatives are needed during training. At inference, student query embeddings are directly compatible with teacher document embeddings.

	---

	## Training

	\| \| Value \|
	\|--\|-------\|
	\| Base model \| `distilbert/distilbert-base-uncased` (66M) \|
	\| Projector \| 2-layer MLP: 768 → 768 → 2048 (2.4M params) \|
	\| Total params \| 69M \|
	\| Objective \| Pointwise cosine alignment with teacher query embeddings \|
	\| Training data \| 1.49M pairs — 711K original + 778K translated queries \|
	\| Languages \| EN (original) + DE, FR, ES, IT, PT (translated via [Helsinki-NLP Opus-MT](https://huggingface.co/Helsinki-NLP)) \|
	\| Epochs \| 10 \|
	\| Batch size \| 1,024 (effective) \|
	\| Learning rate \| 3e-4 (OneCycleLR, 3% warmup) \|
	\| Hardware \| 1× H200 GPU \|
	\| Training time \| ~10 GPU-hours \|
	\| Embedding caching \| ~1 GPU-hour (teacher encodes all queries in text mode) \|

	### Multilingual Augmentation Pipeline

	1. Extract 489K English queries from the 711K training set
	2. Translate each to 5 languages using Helsinki-NLP Opus-MT → 778K translated queries
	3. Re-encode translated queries with the frozen teacher in text mode (15 min on H200)
	4. Combine: 711K original + 778K translated = 1.49M training pairs
	5. Train with halved epochs (10 vs 20) and slightly higher lr (3e-4 vs 2e-4) to match total steps

	---

	## Efficiency

	\| \| NanoVDR-S-Multi \| DSE-Qwen2 \| ColPali \| Tomoro-8B \|
	\|--\|-----------------\|-----------\|---------\|-----------\|
	\| Parameters \| 69M \| 2,209M \| ~3B \| 8,000M \|
	\| Query latency (CPU, B=1) \| 51 ms \| 2,539 ms \| 7,300 ms \| GPU only \|
	\| Checkpoint size \| 274 MB \| 8.8 GB \| 11.9 GB \| 35.1 GB \|
	\| Index type \| Single-vector \| Single-vector \| Multi-vector \| Multi-vector \|
	\| Scoring \| Cosine \| Cosine \| MaxSim \| MaxSim \|
	\| Index storage (500K pages) \| 4.1 GB \| 3.1 GB \| 128 GB \| 128 GB \|

	---

	## Model Variants

	NanoVDR-S-Multi is the recommended model. The other variants are provided for research and ablation purposes.

	\| Model \| Backbone \| Params \| v1 \| v2 \| v3 \| Retention \| Latency \| Recommended \|
	\|-------\|----------\|--------\|----\|----\|----\|-----------\|---------\| ------------\|
	\| [NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi) \| DistilBERT \| 69M \| 82.2 \| 61.9 \| 46.5 \| 95.1% \| 51 ms \| Yes \|
	\| [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) \| DistilBERT \| 69M \| 82.2 \| 60.5 \| 43.5 \| 92.4% \| 51 ms \| EN-only \|
	\| [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) \| BERT-base \| 112M \| 82.1 \| 62.2 \| 44.7 \| 94.0% \| 101 ms \| Ablation \|
	\| [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) \| ModernBERT \| 151M \| 82.4 \| 61.5 \| 44.2 \| 93.4% \| 109 ms \| Ablation \|

	## Key Properties

	\| Property \| Value \|
	\|----------\|-------\|
	\| Output dimension \| 2048 (aligned with Qwen3-VL-Embedding-2B) \|
	\| Max sequence length \| 512 tokens \|
	\| Supported languages \| EN, DE, FR, ES, IT, PT \|
	\| Similarity function \| Cosine similarity \|
	\| Pooling \| Mean pooling \|
	\| Normalization \| L2-normalized \|

	## Citation

	```bibtex
	@article{nanovdr2026,
	title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
	author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
	journal={arXiv preprint arXiv:2502.XXXXX},
	year={2026}
	}
	```

	## License

	Apache 2.0