AdithyaSK commited on
Commit
b90978c
Β·
verified Β·
1 Parent(s): 7ea8536

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -42
README.md CHANGED
@@ -1,15 +1,110 @@
1
  ---
2
- license: gemma
3
  language:
4
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - google/gemma-3-4b-it
7
- pipeline_tag: visual-document-retrieval
8
- library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
-
11
  # NetraEmbed
12
 
 
 
 
 
 
 
 
 
13
  **NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
14
 
15
  ## Model Description
@@ -42,14 +137,11 @@ from colpali_engine.models import BiGemma3, BiGemmaProcessor3
42
  # Load model and processor
43
  model_name = "Cognitive-Lab/NetraEmbed"
44
 
45
- # Choose embedding dimension: 768, 1536, or 2560
46
- embedding_dim = 1536 # Use lower dims for faster search, higher for better accuracy
47
-
48
  model = BiGemma3.from_pretrained(
49
  model_name,
50
- dtype=torch.bfloat16,
51
  device_map="cuda",
52
- embedding_dim=embedding_dim, # Matryoshka dimension
53
  )
54
  processor = BiGemmaProcessor3.from_pretrained(model_name)
55
 
@@ -69,9 +161,13 @@ queries = [
69
  batch_images = processor.process_images(images).to(model.device)
70
  batch_queries = processor.process_texts(queries).to(model.device)
71
 
 
 
 
 
72
  with torch.no_grad():
73
- image_embeddings = model(**batch_images) # Shape: (num_images, embedding_dim)
74
- query_embeddings = model(**batch_queries) # Shape: (num_queries, embedding_dim)
75
 
76
  # Compute similarity scores using cosine similarity
77
  scores = processor.score(
@@ -85,9 +181,34 @@ for i, query in enumerate(queries):
85
  print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
86
  ```
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  ## Matryoshka Embeddings
89
 
90
- NetraEmbed supports three embedding dimensions:
91
 
92
  | Dimension | Use Case | Speed | Accuracy |
93
  |-----------|----------|-------|----------|
@@ -95,7 +216,7 @@ NetraEmbed supports three embedding dimensions:
95
  | 1536 | Balanced performance | ⚑⚑ | ⭐⭐⭐ |
96
  | 2560 | Maximum accuracy | ⚑ | ⭐⭐⭐⭐ |
97
 
98
- Choose the dimension that best fits your latency and accuracy requirements. You can even switch dimensions without retraining!
99
 
100
  ## Use Cases
101
 
@@ -106,42 +227,61 @@ Choose the dimension that best fits your latency and accuracy requirements. You
106
 
107
  ## Model Details
108
 
109
- - **Base Model:** Gemma3-2B
110
  - **Vision Encoder:** SigLIP
111
  - **Training Data:** Multilingual document datasets
112
  - **Embedding Strategy:** Single-vector (BiEncoder)
113
  - **Similarity Function:** Cosine similarity
114
  - **Matryoshka Dimensions:** 768, 1536, 2560
115
 
116
- ## Integration with Vector Databases
117
-
118
- NetraEmbed works seamlessly with popular vector databases:
119
-
120
- ```python
121
- import faiss
122
- import numpy as np
123
-
124
- # Create FAISS index
125
- dimension = 1536
126
- index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
127
-
128
- # Add image embeddings to index
129
- embeddings_np = image_embeddings.cpu().numpy()
130
- faiss.normalize_L2(embeddings_np) # Embeddings are already normalized
131
- index.add(embeddings_np)
132
-
133
- # Search
134
- query_np = query_embeddings[0:1].cpu().numpy()
135
- k = 5 # Top 5 results
136
- distances, indices = index.search(query_np, k)
137
-
138
- print(f"Top {k} matches:", indices[0])
139
- print(f"Scores:", distances[0])
140
- ```
141
-
142
  ## Performance
143
 
144
- NetraEmbed achieves competitive performance on visual document retrieval benchmarks while being significantly faster than multi-vector approaches. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  ## Citation
147
 
@@ -163,4 +303,6 @@ This model is released under the same license as the base Gemma3 model.
163
 
164
  ## Acknowledgments
165
 
166
- Built on top of the Gemma3 architecture with Matryoshka representation learning.
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ - es
5
+ - fr
6
+ - de
7
+ - it
8
+ - hi
9
+ - mr
10
+ - sa
11
+ - kn
12
+ - te
13
+ - ta
14
+ - ml
15
+ - zh
16
+ - ja
17
+ - ko
18
+ - ar
19
+ - bn
20
+ - gu
21
+ - or
22
+ - pa
23
+ - ru
24
+ - th
25
+ license: gemma
26
+ library_name: transformers
27
+ tags:
28
+ - vision-language
29
+ - retrieval
30
+ - dense vector
31
+ pipeline_tag: visual-document-retrieval
32
  base_model:
33
  - google/gemma-3-4b-it
34
+ model-index:
35
+ - name: NetraEmbed
36
+ results:
37
+ - task:
38
+ type: image-text-retrieval
39
+ name: Cross-Lingual Document Retrieval
40
+ dataset:
41
+ type: Cognitive-Lab/nayanair-bench
42
+ name: Nayana-IR Cross-Lingual
43
+ split: test
44
+ metrics:
45
+ - type: ndcg_at_5
46
+ value: 0.716
47
+ name: NDCG@5
48
+ - type: recall_at_10
49
+ value: 0.871
50
+ name: Recall@10
51
+ - type: map_at_10
52
+ value: 0.703
53
+ name: MAP@10
54
+ - type: mrr_at_10
55
+ value: 0.775
56
+ name: MRR@10
57
+ - task:
58
+ type: image-text-retrieval
59
+ name: Monolingual Document Retrieval
60
+ dataset:
61
+ type: Cognitive-Lab/nayanair-bench
62
+ name: Nayana-IR Monolingual
63
+ split: test
64
+ metrics:
65
+ - type: ndcg_at_5
66
+ value: 0.738
67
+ name: NDCG@5
68
+ - type: recall_at_10
69
+ value: 0.844
70
+ name: Recall@10
71
+ - type: map_at_10
72
+ value: 0.709
73
+ name: MAP@10
74
+ - type: mrr_at_10
75
+ value: 0.751
76
+ name: MRR@10
77
+ - task:
78
+ type: image-text-retrieval
79
+ name: English Document Retrieval
80
+ dataset:
81
+ type: vidore/vidore-benchmark
82
+ name: ViDoRe v2
83
+ split: test
84
+ metrics:
85
+ - type: ndcg_at_5
86
+ value: 0.554
87
+ name: NDCG@5
88
+ - type: recall_at_10
89
+ value: 0.637
90
+ name: Recall@10
91
+ - type: map_at_10
92
+ value: 0.437
93
+ name: MAP@10
94
+ - type: mrr_at_10
95
+ value: 0.647
96
+ name: MRR@10
97
  ---
 
98
  # NetraEmbed
99
 
100
+ ![NetraEmbed Banner](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/wNumrelVx2ldL9VffaiGS.png)
101
+
102
+ [![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514)
103
+ [![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali)
104
+ [![Model](https://img.shields.io/badge/πŸ€—%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/NetraEmbed)
105
+ [![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed)
106
+ [![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://cloud.cognitivelab.in)
107
+
108
  **NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
109
 
110
  ## Model Description
 
137
  # Load model and processor
138
  model_name = "Cognitive-Lab/NetraEmbed"
139
 
140
+ # Load model once (supports all Matryoshka dimensions)
 
 
141
  model = BiGemma3.from_pretrained(
142
  model_name,
143
+ torch_dtype=torch.bfloat16,
144
  device_map="cuda",
 
145
  )
146
  processor = BiGemmaProcessor3.from_pretrained(model_name)
147
 
 
161
  batch_images = processor.process_images(images).to(model.device)
162
  batch_queries = processor.process_texts(queries).to(model.device)
163
 
164
+ # Choose embedding dimension at inference time: 768, 1536, or 2560
165
+ # Use lower dims for faster search, higher for better accuracy
166
+ embedding_dim = 1536 # Balanced performance
167
+
168
  with torch.no_grad():
169
+ image_embeddings = model(**batch_images, embedding_dim=embedding_dim) # Shape: (num_images, embedding_dim)
170
+ query_embeddings = model(**batch_queries, embedding_dim=embedding_dim) # Shape: (num_queries, embedding_dim)
171
 
172
  # Compute similarity scores using cosine similarity
173
  scores = processor.score(
 
181
  print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
182
  ```
183
 
184
+ ### Testing Multiple Dimensions
185
+
186
+ You can test different embedding dimensions without reloading the model:
187
+
188
+ ```python
189
+ # Load model once
190
+ model = BiGemma3.from_pretrained(
191
+ model_name,
192
+ torch_dtype=torch.bfloat16,
193
+ device_map="cuda",
194
+ )
195
+
196
+ # Test all Matryoshka dimensions
197
+ for embedding_dim in [768, 1536, 2560]:
198
+ print(f"\nTesting dimension: {embedding_dim}")
199
+
200
+ with torch.no_grad():
201
+ image_embeddings = model(**batch_images, embedding_dim=embedding_dim)
202
+ query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)
203
+
204
+ scores = processor.score(qs=query_embeddings, ps=image_embeddings)
205
+ print(f"Scores shape: {scores.shape}")
206
+ print(f"Best match score: {scores.max().item():.4f}")
207
+ ```
208
+
209
  ## Matryoshka Embeddings
210
 
211
+ NetraEmbed supports three embedding dimensions that can be selected **at inference time**:
212
 
213
  | Dimension | Use Case | Speed | Accuracy |
214
  |-----------|----------|-------|----------|
 
216
  | 1536 | Balanced performance | ⚑⚑ | ⭐⭐⭐ |
217
  | 2560 | Maximum accuracy | ⚑ | ⭐⭐⭐⭐ |
218
 
219
+ **Key Advantage:** Load the model once and dynamically choose dimensions at inference time. No need to reload the model to test different dimensions or switch between accuracy/speed trade-offs!
220
 
221
  ## Use Cases
222
 
 
227
 
228
  ## Model Details
229
 
230
+ - **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
231
  - **Vision Encoder:** SigLIP
232
  - **Training Data:** Multilingual document datasets
233
  - **Embedding Strategy:** Single-vector (BiEncoder)
234
  - **Similarity Function:** Cosine similarity
235
  - **Matryoshka Dimensions:** 768, 1536, 2560
236
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
  ## Performance
238
 
239
+ NetraEmbed achieves state-of-the-art performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.
240
+
241
+ ### Benchmark Results
242
+
243
+ **Nayana-IR Cross-Lingual**
244
+
245
+ | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
246
+ |-------|:------:|:---------:|:------:|:------:|
247
+ | **NetraEmbed** | **0.716** | **0.871** | **0.703** | **0.775** |
248
+ | Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
249
+ | ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
250
+ | ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
251
+ | GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
252
+ | ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
253
+ | ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |
254
+
255
+ **Nayana-IR Monolingual**
256
+
257
+ | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
258
+ |-------|:------:|:---------:|:------:|:------:|
259
+ | **NetraEmbed** | **0.738** | **0.844** | **0.709** | **0.751** |
260
+ | ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
261
+ | ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
262
+ | GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
263
+ | ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
264
+ | ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |
265
+
266
+ **ViDoRe v2**
267
+
268
+ | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
269
+ |-------|:------:|:---------:|:------:|:------:|
270
+ | ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
271
+ | Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
272
+ | GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
273
+ | ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
274
+ | **NetraEmbed** | **0.554** | **0.637** | **0.437** | **0.647** |
275
+ | ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
276
+ | ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |
277
+
278
+ **Key Results:**
279
+ - πŸ† **State-of-the-art** on multilingual retrieval (0.716 NDCG@5 cross-lingual)
280
+ - πŸ“ˆ **152% improvement** over ColPali-v1.3 on cross-lingual tasks
281
+ - 🌍 Consistent performance across **22 languages** and diverse scripts
282
+ - ⚑ **250x more efficient** than multi-vector approaches (~10KB vs ~2.5MB per document)
283
+
284
+ See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and per-language analysis.
285
 
286
  ## Citation
287
 
 
303
 
304
  ## Acknowledgments
305
 
306
+ This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).
307
+
308
+ Built on top of the Gemma3 architecture with Matryoshka representation learning.