NetraEmbed

@@ -1,15 +1,110 @@
 ---
-license: gemma
 language:
 - en
 base_model:
 - google/gemma-3-4b-it
-pipeline_tag: visual-document-retrieval
-library_name: transformers
 ---
 # NetraEmbed
 **NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
 ## Model Description
@@ -42,14 +137,11 @@ from colpali_engine.models import BiGemma3, BiGemmaProcessor3
 # Load model and processor
 model_name = "Cognitive-Lab/NetraEmbed"
-# Choose embedding dimension: 768, 1536, or 2560
-embedding_dim = 1536  # Use lower dims for faster search, higher for better accuracy
 model = BiGemma3.from_pretrained(
     model_name,
-    dtype=torch.bfloat16,
     device_map="cuda",
-    embedding_dim=embedding_dim,  # Matryoshka dimension
 )
 processor = BiGemmaProcessor3.from_pretrained(model_name)
@@ -69,9 +161,13 @@ queries = [
 batch_images = processor.process_images(images).to(model.device)
 batch_queries = processor.process_texts(queries).to(model.device)
 with torch.no_grad():
-    image_embeddings = model(**batch_images)  # Shape: (num_images, embedding_dim)
-    query_embeddings = model(**batch_queries)  # Shape: (num_queries, embedding_dim)
 # Compute similarity scores using cosine similarity
 scores = processor.score(
@@ -85,9 +181,34 @@ for i, query in enumerate(queries):
     print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
 ```
 ## Matryoshka Embeddings
-NetraEmbed supports three embedding dimensions:
 | Dimension | Use Case | Speed | Accuracy |
 |-----------|----------|-------|----------|
@@ -95,7 +216,7 @@ NetraEmbed supports three embedding dimensions:
 | 1536 | Balanced performance | ⚡⚡ | ⭐⭐⭐ |
 | 2560 | Maximum accuracy | ⚡ | ⭐⭐⭐⭐ |
-Choose the dimension that best fits your latency and accuracy requirements. You can even switch dimensions without retraining!
 ## Use Cases
@@ -106,42 +227,61 @@ Choose the dimension that best fits your latency and accuracy requirements. You
 ## Model Details
-- **Base Model:** Gemma3-2B
 - **Vision Encoder:** SigLIP
 - **Training Data:** Multilingual document datasets
 - **Embedding Strategy:** Single-vector (BiEncoder)
 - **Similarity Function:** Cosine similarity
 - **Matryoshka Dimensions:** 768, 1536, 2560
-## Integration with Vector Databases
-NetraEmbed works seamlessly with popular vector databases:
-```python
-import faiss
-import numpy as np
-# Create FAISS index
-dimension = 1536
-index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
-# Add image embeddings to index
-embeddings_np = image_embeddings.cpu().numpy()
-faiss.normalize_L2(embeddings_np)  # Embeddings are already normalized
-index.add(embeddings_np)
-# Search
-query_np = query_embeddings[0:1].cpu().numpy()
-k = 5  # Top 5 results
-distances, indices = index.search(query_np, k)
-print(f"Top {k} matches:", indices[0])
-print(f"Scores:", distances[0])
-```
 ## Performance
-NetraEmbed achieves competitive performance on visual document retrieval benchmarks while being significantly faster than multi-vector approaches. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation.
 ## Citation
@@ -163,4 +303,6 @@ This model is released under the same license as the base Gemma3 model.
 ## Acknowledgments
-Built on top of the Gemma3 architecture with Matryoshka representation learning.

 ---
 language:
 - en
+- es
+- fr
+- de
+- it
+- hi
+- mr
+- sa
+- kn
+- te
+- ta
+- ml
+- zh
+- ja
+- ko
+- ar
+- bn
+- gu
+- or
+- pa
+- ru
+- th
+license: gemma
+library_name: transformers
+tags:
+- vision-language
+- retrieval
+- dense vector
+pipeline_tag: visual-document-retrieval
 base_model:
 - google/gemma-3-4b-it
+model-index:
+- name: NetraEmbed
+  results:
+  - task:
+      type: image-text-retrieval
+      name: Cross-Lingual Document Retrieval
+    dataset:
+      type: Cognitive-Lab/nayanair-bench
+      name: Nayana-IR Cross-Lingual
+      split: test
+    metrics:
+    - type: ndcg_at_5
+      value: 0.716
+      name: NDCG@5
+    - type: recall_at_10
+      value: 0.871
+      name: Recall@10
+    - type: map_at_10
+      value: 0.703
+      name: MAP@10
+    - type: mrr_at_10
+      value: 0.775
+      name: MRR@10
+  - task:
+      type: image-text-retrieval
+      name: Monolingual Document Retrieval
+    dataset:
+      type: Cognitive-Lab/nayanair-bench
+      name: Nayana-IR Monolingual
+      split: test
+    metrics:
+    - type: ndcg_at_5
+      value: 0.738
+      name: NDCG@5
+    - type: recall_at_10
+      value: 0.844
+      name: Recall@10
+    - type: map_at_10
+      value: 0.709
+      name: MAP@10
+    - type: mrr_at_10
+      value: 0.751
+      name: MRR@10
+  - task:
+      type: image-text-retrieval
+      name: English Document Retrieval
+    dataset:
+      type: vidore/vidore-benchmark
+      name: ViDoRe v2
+      split: test
+    metrics:
+    - type: ndcg_at_5
+      value: 0.554
+      name: NDCG@5
+    - type: recall_at_10
+      value: 0.637
+      name: Recall@10
+    - type: map_at_10
+      value: 0.437
+      name: MAP@10
+    - type: mrr_at_10
+      value: 0.647
+      name: MRR@10
 ---
 # NetraEmbed
+![NetraEmbed Banner](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/wNumrelVx2ldL9VffaiGS.png)
+[![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514)
+[![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali)
+[![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/NetraEmbed)
+[![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed)
+[![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://cloud.cognitivelab.in)
 **NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
 ## Model Description
 # Load model and processor
 model_name = "Cognitive-Lab/NetraEmbed"
+# Load model once (supports all Matryoshka dimensions)
 model = BiGemma3.from_pretrained(
     model_name,
+    torch_dtype=torch.bfloat16,
     device_map="cuda",
 )
 processor = BiGemmaProcessor3.from_pretrained(model_name)
 batch_images = processor.process_images(images).to(model.device)
 batch_queries = processor.process_texts(queries).to(model.device)
+# Choose embedding dimension at inference time: 768, 1536, or 2560
+# Use lower dims for faster search, higher for better accuracy
+embedding_dim = 1536  # Balanced performance
 with torch.no_grad():
+    image_embeddings = model(**batch_images, embedding_dim=embedding_dim)  # Shape: (num_images, embedding_dim)
+    query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)  # Shape: (num_queries, embedding_dim)
 # Compute similarity scores using cosine similarity
 scores = processor.score(
     print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
 ```
+### Testing Multiple Dimensions
+You can test different embedding dimensions without reloading the model:
+```python
+# Load model once
+model = BiGemma3.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda",
+)
+# Test all Matryoshka dimensions
+for embedding_dim in [768, 1536, 2560]:
+    print(f"\nTesting dimension: {embedding_dim}")
+    with torch.no_grad():
+        image_embeddings = model(**batch_images, embedding_dim=embedding_dim)
+        query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)
+    scores = processor.score(qs=query_embeddings, ps=image_embeddings)
+    print(f"Scores shape: {scores.shape}")
+    print(f"Best match score: {scores.max().item():.4f}")
+```
 ## Matryoshka Embeddings
+NetraEmbed supports three embedding dimensions that can be selected **at inference time**:
 | Dimension | Use Case | Speed | Accuracy |
 |-----------|----------|-------|----------|
 | 1536 | Balanced performance | ⚡⚡ | ⭐⭐⭐ |
 | 2560 | Maximum accuracy | ⚡ | ⭐⭐⭐⭐ |
+**Key Advantage:** Load the model once and dynamically choose dimensions at inference time. No need to reload the model to test different dimensions or switch between accuracy/speed trade-offs!
 ## Use Cases
 ## Model Details
+- **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
 - **Vision Encoder:** SigLIP
 - **Training Data:** Multilingual document datasets
 - **Embedding Strategy:** Single-vector (BiEncoder)
 - **Similarity Function:** Cosine similarity
 - **Matryoshka Dimensions:** 768, 1536, 2560
 ## Performance
+NetraEmbed achieves state-of-the-art performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.
+### Benchmark Results
+**Nayana-IR Cross-Lingual**
+| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
+|-------|:------:|:---------:|:------:|:------:|
+| **NetraEmbed** | **0.716** | **0.871** | **0.703** | **0.775** |
+| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
+| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
+| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
+| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
+| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
+| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |
+**Nayana-IR Monolingual**
+| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
+|-------|:------:|:---------:|:------:|:------:|
+| **NetraEmbed** | **0.738** | **0.844** | **0.709** | **0.751** |
+| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
+| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
+| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
+| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
+| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |
+**ViDoRe v2**
+| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
+|-------|:------:|:---------:|:------:|:------:|
+| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
+| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
+| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
+| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
+| **NetraEmbed** | **0.554** | **0.637** | **0.437** | **0.647** |
+| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
+| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |
+**Key Results:**
+- 🏆 **State-of-the-art** on multilingual retrieval (0.716 NDCG@5 cross-lingual)
+- 📈 **152% improvement** over ColPali-v1.3 on cross-lingual tasks
+- 🌍 Consistent performance across **22 languages** and diverse scripts
+- ⚡ **250x more efficient** than multi-vector approaches (~10KB vs ~2.5MB per document)
+See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and per-language analysis.
 ## Citation
 ## Acknowledgments
+This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).
+Built on top of the Gemma3 architecture with Matryoshka representation learning.