Update README.md
Browse files
README.md
CHANGED
|
@@ -1,15 +1,110 @@
|
|
| 1 |
---
|
| 2 |
-
license: gemma
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
base_model:
|
| 6 |
- google/gemma-3-4b-it
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
-
|
| 11 |
# NetraEmbed
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
**NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
|
| 14 |
|
| 15 |
## Model Description
|
|
@@ -42,14 +137,11 @@ from colpali_engine.models import BiGemma3, BiGemmaProcessor3
|
|
| 42 |
# Load model and processor
|
| 43 |
model_name = "Cognitive-Lab/NetraEmbed"
|
| 44 |
|
| 45 |
-
#
|
| 46 |
-
embedding_dim = 1536 # Use lower dims for faster search, higher for better accuracy
|
| 47 |
-
|
| 48 |
model = BiGemma3.from_pretrained(
|
| 49 |
model_name,
|
| 50 |
-
|
| 51 |
device_map="cuda",
|
| 52 |
-
embedding_dim=embedding_dim, # Matryoshka dimension
|
| 53 |
)
|
| 54 |
processor = BiGemmaProcessor3.from_pretrained(model_name)
|
| 55 |
|
|
@@ -69,9 +161,13 @@ queries = [
|
|
| 69 |
batch_images = processor.process_images(images).to(model.device)
|
| 70 |
batch_queries = processor.process_texts(queries).to(model.device)
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
with torch.no_grad():
|
| 73 |
-
image_embeddings = model(**batch_images) # Shape: (num_images, embedding_dim)
|
| 74 |
-
query_embeddings = model(**batch_queries) # Shape: (num_queries, embedding_dim)
|
| 75 |
|
| 76 |
# Compute similarity scores using cosine similarity
|
| 77 |
scores = processor.score(
|
|
@@ -85,9 +181,34 @@ for i, query in enumerate(queries):
|
|
| 85 |
print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
|
| 86 |
```
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
## Matryoshka Embeddings
|
| 89 |
|
| 90 |
-
NetraEmbed supports three embedding dimensions
|
| 91 |
|
| 92 |
| Dimension | Use Case | Speed | Accuracy |
|
| 93 |
|-----------|----------|-------|----------|
|
|
@@ -95,7 +216,7 @@ NetraEmbed supports three embedding dimensions:
|
|
| 95 |
| 1536 | Balanced performance | β‘β‘ | βββ |
|
| 96 |
| 2560 | Maximum accuracy | β‘ | ββββ |
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
## Use Cases
|
| 101 |
|
|
@@ -106,42 +227,61 @@ Choose the dimension that best fits your latency and accuracy requirements. You
|
|
| 106 |
|
| 107 |
## Model Details
|
| 108 |
|
| 109 |
-
- **Base Model:** Gemma3-
|
| 110 |
- **Vision Encoder:** SigLIP
|
| 111 |
- **Training Data:** Multilingual document datasets
|
| 112 |
- **Embedding Strategy:** Single-vector (BiEncoder)
|
| 113 |
- **Similarity Function:** Cosine similarity
|
| 114 |
- **Matryoshka Dimensions:** 768, 1536, 2560
|
| 115 |
|
| 116 |
-
## Integration with Vector Databases
|
| 117 |
-
|
| 118 |
-
NetraEmbed works seamlessly with popular vector databases:
|
| 119 |
-
|
| 120 |
-
```python
|
| 121 |
-
import faiss
|
| 122 |
-
import numpy as np
|
| 123 |
-
|
| 124 |
-
# Create FAISS index
|
| 125 |
-
dimension = 1536
|
| 126 |
-
index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
|
| 127 |
-
|
| 128 |
-
# Add image embeddings to index
|
| 129 |
-
embeddings_np = image_embeddings.cpu().numpy()
|
| 130 |
-
faiss.normalize_L2(embeddings_np) # Embeddings are already normalized
|
| 131 |
-
index.add(embeddings_np)
|
| 132 |
-
|
| 133 |
-
# Search
|
| 134 |
-
query_np = query_embeddings[0:1].cpu().numpy()
|
| 135 |
-
k = 5 # Top 5 results
|
| 136 |
-
distances, indices = index.search(query_np, k)
|
| 137 |
-
|
| 138 |
-
print(f"Top {k} matches:", indices[0])
|
| 139 |
-
print(f"Scores:", distances[0])
|
| 140 |
-
```
|
| 141 |
-
|
| 142 |
## Performance
|
| 143 |
|
| 144 |
-
NetraEmbed achieves
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
## Citation
|
| 147 |
|
|
@@ -163,4 +303,6 @@ This model is released under the same license as the base Gemma3 model.
|
|
| 163 |
|
| 164 |
## Acknowledgments
|
| 165 |
|
| 166 |
-
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
+
- es
|
| 5 |
+
- fr
|
| 6 |
+
- de
|
| 7 |
+
- it
|
| 8 |
+
- hi
|
| 9 |
+
- mr
|
| 10 |
+
- sa
|
| 11 |
+
- kn
|
| 12 |
+
- te
|
| 13 |
+
- ta
|
| 14 |
+
- ml
|
| 15 |
+
- zh
|
| 16 |
+
- ja
|
| 17 |
+
- ko
|
| 18 |
+
- ar
|
| 19 |
+
- bn
|
| 20 |
+
- gu
|
| 21 |
+
- or
|
| 22 |
+
- pa
|
| 23 |
+
- ru
|
| 24 |
+
- th
|
| 25 |
+
license: gemma
|
| 26 |
+
library_name: transformers
|
| 27 |
+
tags:
|
| 28 |
+
- vision-language
|
| 29 |
+
- retrieval
|
| 30 |
+
- dense vector
|
| 31 |
+
pipeline_tag: visual-document-retrieval
|
| 32 |
base_model:
|
| 33 |
- google/gemma-3-4b-it
|
| 34 |
+
model-index:
|
| 35 |
+
- name: NetraEmbed
|
| 36 |
+
results:
|
| 37 |
+
- task:
|
| 38 |
+
type: image-text-retrieval
|
| 39 |
+
name: Cross-Lingual Document Retrieval
|
| 40 |
+
dataset:
|
| 41 |
+
type: Cognitive-Lab/nayanair-bench
|
| 42 |
+
name: Nayana-IR Cross-Lingual
|
| 43 |
+
split: test
|
| 44 |
+
metrics:
|
| 45 |
+
- type: ndcg_at_5
|
| 46 |
+
value: 0.716
|
| 47 |
+
name: NDCG@5
|
| 48 |
+
- type: recall_at_10
|
| 49 |
+
value: 0.871
|
| 50 |
+
name: Recall@10
|
| 51 |
+
- type: map_at_10
|
| 52 |
+
value: 0.703
|
| 53 |
+
name: MAP@10
|
| 54 |
+
- type: mrr_at_10
|
| 55 |
+
value: 0.775
|
| 56 |
+
name: MRR@10
|
| 57 |
+
- task:
|
| 58 |
+
type: image-text-retrieval
|
| 59 |
+
name: Monolingual Document Retrieval
|
| 60 |
+
dataset:
|
| 61 |
+
type: Cognitive-Lab/nayanair-bench
|
| 62 |
+
name: Nayana-IR Monolingual
|
| 63 |
+
split: test
|
| 64 |
+
metrics:
|
| 65 |
+
- type: ndcg_at_5
|
| 66 |
+
value: 0.738
|
| 67 |
+
name: NDCG@5
|
| 68 |
+
- type: recall_at_10
|
| 69 |
+
value: 0.844
|
| 70 |
+
name: Recall@10
|
| 71 |
+
- type: map_at_10
|
| 72 |
+
value: 0.709
|
| 73 |
+
name: MAP@10
|
| 74 |
+
- type: mrr_at_10
|
| 75 |
+
value: 0.751
|
| 76 |
+
name: MRR@10
|
| 77 |
+
- task:
|
| 78 |
+
type: image-text-retrieval
|
| 79 |
+
name: English Document Retrieval
|
| 80 |
+
dataset:
|
| 81 |
+
type: vidore/vidore-benchmark
|
| 82 |
+
name: ViDoRe v2
|
| 83 |
+
split: test
|
| 84 |
+
metrics:
|
| 85 |
+
- type: ndcg_at_5
|
| 86 |
+
value: 0.554
|
| 87 |
+
name: NDCG@5
|
| 88 |
+
- type: recall_at_10
|
| 89 |
+
value: 0.637
|
| 90 |
+
name: Recall@10
|
| 91 |
+
- type: map_at_10
|
| 92 |
+
value: 0.437
|
| 93 |
+
name: MAP@10
|
| 94 |
+
- type: mrr_at_10
|
| 95 |
+
value: 0.647
|
| 96 |
+
name: MRR@10
|
| 97 |
---
|
|
|
|
| 98 |
# NetraEmbed
|
| 99 |
|
| 100 |
+

|
| 101 |
+
|
| 102 |
+
[](https://arxiv.org/abs/2512.03514)
|
| 103 |
+
[](https://github.com/adithya-s-k/colpali)
|
| 104 |
+
[](https://huggingface.co/Cognitive-Lab/NetraEmbed)
|
| 105 |
+
[](https://www.cognitivelab.in/blog/introducing-netraembed)
|
| 106 |
+
[](https://cloud.cognitivelab.in)
|
| 107 |
+
|
| 108 |
**NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
|
| 109 |
|
| 110 |
## Model Description
|
|
|
|
| 137 |
# Load model and processor
|
| 138 |
model_name = "Cognitive-Lab/NetraEmbed"
|
| 139 |
|
| 140 |
+
# Load model once (supports all Matryoshka dimensions)
|
|
|
|
|
|
|
| 141 |
model = BiGemma3.from_pretrained(
|
| 142 |
model_name,
|
| 143 |
+
torch_dtype=torch.bfloat16,
|
| 144 |
device_map="cuda",
|
|
|
|
| 145 |
)
|
| 146 |
processor = BiGemmaProcessor3.from_pretrained(model_name)
|
| 147 |
|
|
|
|
| 161 |
batch_images = processor.process_images(images).to(model.device)
|
| 162 |
batch_queries = processor.process_texts(queries).to(model.device)
|
| 163 |
|
| 164 |
+
# Choose embedding dimension at inference time: 768, 1536, or 2560
|
| 165 |
+
# Use lower dims for faster search, higher for better accuracy
|
| 166 |
+
embedding_dim = 1536 # Balanced performance
|
| 167 |
+
|
| 168 |
with torch.no_grad():
|
| 169 |
+
image_embeddings = model(**batch_images, embedding_dim=embedding_dim) # Shape: (num_images, embedding_dim)
|
| 170 |
+
query_embeddings = model(**batch_queries, embedding_dim=embedding_dim) # Shape: (num_queries, embedding_dim)
|
| 171 |
|
| 172 |
# Compute similarity scores using cosine similarity
|
| 173 |
scores = processor.score(
|
|
|
|
| 181 |
print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
|
| 182 |
```
|
| 183 |
|
| 184 |
+
### Testing Multiple Dimensions
|
| 185 |
+
|
| 186 |
+
You can test different embedding dimensions without reloading the model:
|
| 187 |
+
|
| 188 |
+
```python
|
| 189 |
+
# Load model once
|
| 190 |
+
model = BiGemma3.from_pretrained(
|
| 191 |
+
model_name,
|
| 192 |
+
torch_dtype=torch.bfloat16,
|
| 193 |
+
device_map="cuda",
|
| 194 |
+
)
|
| 195 |
+
|
| 196 |
+
# Test all Matryoshka dimensions
|
| 197 |
+
for embedding_dim in [768, 1536, 2560]:
|
| 198 |
+
print(f"\nTesting dimension: {embedding_dim}")
|
| 199 |
+
|
| 200 |
+
with torch.no_grad():
|
| 201 |
+
image_embeddings = model(**batch_images, embedding_dim=embedding_dim)
|
| 202 |
+
query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)
|
| 203 |
+
|
| 204 |
+
scores = processor.score(qs=query_embeddings, ps=image_embeddings)
|
| 205 |
+
print(f"Scores shape: {scores.shape}")
|
| 206 |
+
print(f"Best match score: {scores.max().item():.4f}")
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
## Matryoshka Embeddings
|
| 210 |
|
| 211 |
+
NetraEmbed supports three embedding dimensions that can be selected **at inference time**:
|
| 212 |
|
| 213 |
| Dimension | Use Case | Speed | Accuracy |
|
| 214 |
|-----------|----------|-------|----------|
|
|
|
|
| 216 |
| 1536 | Balanced performance | β‘β‘ | βββ |
|
| 217 |
| 2560 | Maximum accuracy | β‘ | ββββ |
|
| 218 |
|
| 219 |
+
**Key Advantage:** Load the model once and dynamically choose dimensions at inference time. No need to reload the model to test different dimensions or switch between accuracy/speed trade-offs!
|
| 220 |
|
| 221 |
## Use Cases
|
| 222 |
|
|
|
|
| 227 |
|
| 228 |
## Model Details
|
| 229 |
|
| 230 |
+
- **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
|
| 231 |
- **Vision Encoder:** SigLIP
|
| 232 |
- **Training Data:** Multilingual document datasets
|
| 233 |
- **Embedding Strategy:** Single-vector (BiEncoder)
|
| 234 |
- **Similarity Function:** Cosine similarity
|
| 235 |
- **Matryoshka Dimensions:** 768, 1536, 2560
|
| 236 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 237 |
## Performance
|
| 238 |
|
| 239 |
+
NetraEmbed achieves state-of-the-art performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.
|
| 240 |
+
|
| 241 |
+
### Benchmark Results
|
| 242 |
+
|
| 243 |
+
**Nayana-IR Cross-Lingual**
|
| 244 |
+
|
| 245 |
+
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|
| 246 |
+
|-------|:------:|:---------:|:------:|:------:|
|
| 247 |
+
| **NetraEmbed** | **0.716** | **0.871** | **0.703** | **0.775** |
|
| 248 |
+
| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
|
| 249 |
+
| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
|
| 250 |
+
| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
|
| 251 |
+
| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
|
| 252 |
+
| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
|
| 253 |
+
| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |
|
| 254 |
+
|
| 255 |
+
**Nayana-IR Monolingual**
|
| 256 |
+
|
| 257 |
+
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|
| 258 |
+
|-------|:------:|:---------:|:------:|:------:|
|
| 259 |
+
| **NetraEmbed** | **0.738** | **0.844** | **0.709** | **0.751** |
|
| 260 |
+
| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
|
| 261 |
+
| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
|
| 262 |
+
| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
|
| 263 |
+
| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
|
| 264 |
+
| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |
|
| 265 |
+
|
| 266 |
+
**ViDoRe v2**
|
| 267 |
+
|
| 268 |
+
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|
| 269 |
+
|-------|:------:|:---------:|:------:|:------:|
|
| 270 |
+
| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
|
| 271 |
+
| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
|
| 272 |
+
| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
|
| 273 |
+
| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
|
| 274 |
+
| **NetraEmbed** | **0.554** | **0.637** | **0.437** | **0.647** |
|
| 275 |
+
| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
|
| 276 |
+
| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |
|
| 277 |
+
|
| 278 |
+
**Key Results:**
|
| 279 |
+
- π **State-of-the-art** on multilingual retrieval (0.716 NDCG@5 cross-lingual)
|
| 280 |
+
- π **152% improvement** over ColPali-v1.3 on cross-lingual tasks
|
| 281 |
+
- π Consistent performance across **22 languages** and diverse scripts
|
| 282 |
+
- β‘ **250x more efficient** than multi-vector approaches (~10KB vs ~2.5MB per document)
|
| 283 |
+
|
| 284 |
+
See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and per-language analysis.
|
| 285 |
|
| 286 |
## Citation
|
| 287 |
|
|
|
|
| 303 |
|
| 304 |
## Acknowledgments
|
| 305 |
|
| 306 |
+
This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).
|
| 307 |
+
|
| 308 |
+
Built on top of the Gemma3 architecture with Matryoshka representation learning.
|