File size: 10,652 Bytes
f73a239
 
 
b90978c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
911b422
 
 
 
 
 
b90978c
f73a239
 
b90978c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f73a239
7ea8536
 
b90978c
 
 
 
 
 
41efa28
24942c3
4176488
b90978c
7ea8536
 
 
 
 
 
 
911b422
7ea8536
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b90978c
7ea8536
 
b90978c
7ea8536
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b90978c
 
 
 
7ea8536
b90978c
 
7ea8536
 
 
 
 
 
 
 
 
 
 
 
 
b90978c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ea8536
 
b90978c
7ea8536
 
 
 
 
 
 
b90978c
7ea8536
 
 
 
 
 
 
 
 
 
b90978c
7ea8536
 
 
 
 
 
 
 
b90978c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ea8536
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b90978c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
---
language:
- en
- es
- fr
- de
- it
- hi
- mr
- sa
- kn
- te
- ta
- ml
- zh
- ja
- ko
- ar
- bn
- gu
- or
- pa
- ru
- th
license: gemma
library_name: transformers
tags:
- vision-language
- retrieval
- multimodal
- multilingual
- document-retrieval
- matryoshka-embeddings
- dense-retrieval
- 22-languages
pipeline_tag: visual-document-retrieval
base_model:
- google/gemma-3-4b-it
model-index:
- name: NetraEmbed
  results:
  - task:
      type: image-text-retrieval
      name: Cross-Lingual Document Retrieval
    dataset:
      type: Cognitive-Lab/nayanair-bench
      name: Nayana-IR Cross-Lingual
      split: test
    metrics:
    - type: ndcg_at_5
      value: 0.716
      name: NDCG@5
    - type: recall_at_10
      value: 0.871
      name: Recall@10
    - type: map_at_10
      value: 0.703
      name: MAP@10
    - type: mrr_at_10
      value: 0.775
      name: MRR@10
  - task:
      type: image-text-retrieval
      name: Monolingual Document Retrieval
    dataset:
      type: Cognitive-Lab/nayanair-bench
      name: Nayana-IR Monolingual
      split: test
    metrics:
    - type: ndcg_at_5
      value: 0.738
      name: NDCG@5
    - type: recall_at_10
      value: 0.844
      name: Recall@10
    - type: map_at_10
      value: 0.709
      name: MAP@10
    - type: mrr_at_10
      value: 0.751
      name: MRR@10
  - task:
      type: image-text-retrieval
      name: English Document Retrieval
    dataset:
      type: vidore/vidore-benchmark
      name: ViDoRe v2
      split: test
    metrics:
    - type: ndcg_at_5
      value: 0.554
      name: NDCG@5
    - type: recall_at_10
      value: 0.637
      name: Recall@10
    - type: map_at_10
      value: 0.437
      name: MAP@10
    - type: mrr_at_10
      value: 0.647
      name: MRR@10
---
# NetraEmbed

![NetraEmbed Banner](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/wNumrelVx2ldL9VffaiGS.png)

[![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514)
[![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali)
[![Model](https://img.shields.io/badge/πŸ€—%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/NetraEmbed)
[![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed)
[![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://huggingface.co/spaces/AdithyaSK/NetraEmbed)
[![Colab](https://img.shields.io/badge/Colab-Run%20Code-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_InferenceDemo.ipynb)
[![Colab](https://img.shields.io/badge/Colab-Gradio%20Demo-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_Gradio_Demo_final.ipynb)

**NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.

## Model Description

NetraEmbed is a multilingual multimodal embedding model that encodes both visual documents and text queries into single dense vectors. It supports multiple languages and enables efficient similarity search at multiple embedding dimensions (768, 1536, 2560) through Matryoshka representation learning.

- **Model Type:** Multilingual Multimodal Embedding Model with Matryoshka embeddings
- **Architecture:** BiEncoder with Gemma3-4B backbone
- **Embedding Dimensions:** 768, 1536, 2560 (Matryoshka)
- **Capabilities:** Multilingual, Multimodal (Vision + Text)
- **Use Case:** Visual document retrieval, multilingual semantic search, cross-lingual document understanding

## Paper

πŸ“„ **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)**

## Installation

```bash
pip install git+https://github.com/adithya-s-k/colpali.git
```

## Quick Start

```python
import torch
from PIL import Image
from colpali_engine.models import BiGemma3, BiGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/NetraEmbed"

# Load model once (supports all Matryoshka dimensions)
model = BiGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = BiGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_texts(queries).to(model.device)

# Choose embedding dimension at inference time: 768, 1536, or 2560
# Use lower dims for faster search, higher for better accuracy
embedding_dim = 1536  # Balanced performance

with torch.no_grad():
    image_embeddings = model(**batch_images, embedding_dim=embedding_dim)  # Shape: (num_images, embedding_dim)
    query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)  # Shape: (num_queries, embedding_dim)

# Compute similarity scores using cosine similarity
scores = processor.score(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
```

### Testing Multiple Dimensions

You can test different embedding dimensions without reloading the model:

```python
# Load model once
model = BiGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

# Test all Matryoshka dimensions
for embedding_dim in [768, 1536, 2560]:
    print(f"\nTesting dimension: {embedding_dim}")

    with torch.no_grad():
        image_embeddings = model(**batch_images, embedding_dim=embedding_dim)
        query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)

    scores = processor.score(qs=query_embeddings, ps=image_embeddings)
    print(f"Scores shape: {scores.shape}")
    print(f"Best match score: {scores.max().item():.4f}")
```

## Matryoshka Embeddings

NetraEmbed supports three embedding dimensions that can be selected **at inference time**:

| Dimension | Use Case | Speed | Accuracy |
|-----------|----------|-------|----------|
| 768 | Fast search, large-scale | ⚑⚑⚑ | ⭐⭐ |
| 1536 | Balanced performance | ⚑⚑ | ⭐⭐⭐ |
| 2560 | Maximum accuracy | ⚑ | ⭐⭐⭐⭐ |

**Key Advantage:** Load the model once and dynamically choose dimensions at inference time. No need to reload the model to test different dimensions or switch between accuracy/speed trade-offs!

## Use Cases

- **Efficient Document Retrieval:** Fast search through millions of documents
- **Semantic Search:** Find visually similar documents
- **Scalable Vector Search:** Works with FAISS, Milvus, Pinecone, etc.
- **Cross-lingual Retrieval:** Multilingual visual document search

## Model Details

- **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
- **Vision Encoder:** SigLIP
- **Training Data:** Multilingual document datasets
- **Embedding Strategy:** Single-vector (BiEncoder)
- **Similarity Function:** Cosine similarity
- **Matryoshka Dimensions:** 768, 1536, 2560

## Performance

NetraEmbed achieves state-of-the-art performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.

### Benchmark Results

**Nayana-IR Cross-Lingual**

| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| **NetraEmbed** | **0.716** | **0.871** | **0.703** | **0.775** |
| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |

**Nayana-IR Monolingual**

| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| **NetraEmbed** | **0.738** | **0.844** | **0.709** | **0.751** |
| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |

**ViDoRe v2**

| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|-------|:------:|:---------:|:------:|:------:|
| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
| **NetraEmbed** | **0.554** | **0.637** | **0.437** | **0.647** |
| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |

**Key Results:**
- πŸ† **State-of-the-art** on multilingual retrieval (0.716 NDCG@5 cross-lingual)
- πŸ“ˆ **152% improvement** over ColPali-v1.3 on cross-lingual tasks
- 🌍 Consistent performance across **22 languages** and diverse scripts
- ⚑ **250x more efficient** than multi-vector approaches (~10KB vs ~2.5MB per document)

See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and per-language analysis.

## Citation

```bibtex
@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}
```

## License

This model is released under the same license as the base Gemma3 model.

## Acknowledgments

This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).

Built on top of the Gemma3 architecture with Matryoshka representation learning.