Update README.md
Browse files
README.md
CHANGED
|
@@ -8,4 +8,159 @@ pipeline_tag: visual-document-retrieval
|
|
| 8 |
library_name: transformers
|
| 9 |
---
|
| 10 |
|
| 11 |
-
# NetraEmbed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
library_name: transformers
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# NetraEmbed
|
| 12 |
+
|
| 13 |
+
**NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
|
| 14 |
+
|
| 15 |
+
## Model Description
|
| 16 |
+
|
| 17 |
+
NetraEmbed is a multilingual multimodal embedding model that encodes both visual documents and text queries into single dense vectors. It supports multiple languages and enables efficient similarity search at multiple embedding dimensions (768, 1536, 2560) through Matryoshka representation learning.
|
| 18 |
+
|
| 19 |
+
- **Model Type:** Multilingual Multimodal Embedding Model with Matryoshka embeddings
|
| 20 |
+
- **Architecture:** BiEncoder with Gemma3-2B backbone
|
| 21 |
+
- **Embedding Dimensions:** 768, 1536, 2560 (Matryoshka)
|
| 22 |
+
- **Capabilities:** Multilingual, Multimodal (Vision + Text)
|
| 23 |
+
- **Use Case:** Visual document retrieval, multilingual semantic search, cross-lingual document understanding
|
| 24 |
+
|
| 25 |
+
## Paper
|
| 26 |
+
|
| 27 |
+
📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)**
|
| 28 |
+
|
| 29 |
+
## Installation
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
pip install git+https://github.com/adithya-s-k/colpali.git
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
## Quick Start
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
import torch
|
| 39 |
+
from PIL import Image
|
| 40 |
+
from colpali_engine.models import BiGemma3, BiGemmaProcessor3
|
| 41 |
+
|
| 42 |
+
# Load model and processor
|
| 43 |
+
model_name = "Cognitive-Lab/NetraEmbed"
|
| 44 |
+
|
| 45 |
+
# Choose embedding dimension: 768, 1536, or 2560
|
| 46 |
+
embedding_dim = 1536 # Use lower dims for faster search, higher for better accuracy
|
| 47 |
+
|
| 48 |
+
model = BiGemma3.from_pretrained(
|
| 49 |
+
model_name,
|
| 50 |
+
dtype=torch.bfloat16,
|
| 51 |
+
device_map="cuda",
|
| 52 |
+
embedding_dim=embedding_dim, # Matryoshka dimension
|
| 53 |
+
)
|
| 54 |
+
processor = BiGemmaProcessor3.from_pretrained(model_name)
|
| 55 |
+
|
| 56 |
+
# Load your images
|
| 57 |
+
images = [
|
| 58 |
+
Image.open("document1.jpg"),
|
| 59 |
+
Image.open("document2.jpg"),
|
| 60 |
+
]
|
| 61 |
+
|
| 62 |
+
# Define queries
|
| 63 |
+
queries = [
|
| 64 |
+
"What is the total revenue?",
|
| 65 |
+
"Show me the organizational chart",
|
| 66 |
+
]
|
| 67 |
+
|
| 68 |
+
# Process and encode
|
| 69 |
+
batch_images = processor.process_images(images).to(model.device)
|
| 70 |
+
batch_queries = processor.process_texts(queries).to(model.device)
|
| 71 |
+
|
| 72 |
+
with torch.no_grad():
|
| 73 |
+
image_embeddings = model(**batch_images) # Shape: (num_images, embedding_dim)
|
| 74 |
+
query_embeddings = model(**batch_queries) # Shape: (num_queries, embedding_dim)
|
| 75 |
+
|
| 76 |
+
# Compute similarity scores using cosine similarity
|
| 77 |
+
scores = processor.score(
|
| 78 |
+
qs=query_embeddings,
|
| 79 |
+
ps=image_embeddings,
|
| 80 |
+
) # Shape: (num_queries, num_images)
|
| 81 |
+
|
| 82 |
+
# Get best matches
|
| 83 |
+
for i, query in enumerate(queries):
|
| 84 |
+
best_idx = scores[i].argmax().item()
|
| 85 |
+
print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
## Matryoshka Embeddings
|
| 89 |
+
|
| 90 |
+
NetraEmbed supports three embedding dimensions:
|
| 91 |
+
|
| 92 |
+
| Dimension | Use Case | Speed | Accuracy |
|
| 93 |
+
|-----------|----------|-------|----------|
|
| 94 |
+
| 768 | Fast search, large-scale | ⚡⚡⚡ | ⭐⭐ |
|
| 95 |
+
| 1536 | Balanced performance | ⚡⚡ | ⭐⭐⭐ |
|
| 96 |
+
| 2560 | Maximum accuracy | ⚡ | ⭐⭐⭐⭐ |
|
| 97 |
+
|
| 98 |
+
Choose the dimension that best fits your latency and accuracy requirements. You can even switch dimensions without retraining!
|
| 99 |
+
|
| 100 |
+
## Use Cases
|
| 101 |
+
|
| 102 |
+
- **Efficient Document Retrieval:** Fast search through millions of documents
|
| 103 |
+
- **Semantic Search:** Find visually similar documents
|
| 104 |
+
- **Scalable Vector Search:** Works with FAISS, Milvus, Pinecone, etc.
|
| 105 |
+
- **Cross-lingual Retrieval:** Multilingual visual document search
|
| 106 |
+
|
| 107 |
+
## Model Details
|
| 108 |
+
|
| 109 |
+
- **Base Model:** Gemma3-2B
|
| 110 |
+
- **Vision Encoder:** SigLIP
|
| 111 |
+
- **Training Data:** Multilingual document datasets
|
| 112 |
+
- **Embedding Strategy:** Single-vector (BiEncoder)
|
| 113 |
+
- **Similarity Function:** Cosine similarity
|
| 114 |
+
- **Matryoshka Dimensions:** 768, 1536, 2560
|
| 115 |
+
|
| 116 |
+
## Integration with Vector Databases
|
| 117 |
+
|
| 118 |
+
NetraEmbed works seamlessly with popular vector databases:
|
| 119 |
+
|
| 120 |
+
```python
|
| 121 |
+
import faiss
|
| 122 |
+
import numpy as np
|
| 123 |
+
|
| 124 |
+
# Create FAISS index
|
| 125 |
+
dimension = 1536
|
| 126 |
+
index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
|
| 127 |
+
|
| 128 |
+
# Add image embeddings to index
|
| 129 |
+
embeddings_np = image_embeddings.cpu().numpy()
|
| 130 |
+
faiss.normalize_L2(embeddings_np) # Embeddings are already normalized
|
| 131 |
+
index.add(embeddings_np)
|
| 132 |
+
|
| 133 |
+
# Search
|
| 134 |
+
query_np = query_embeddings[0:1].cpu().numpy()
|
| 135 |
+
k = 5 # Top 5 results
|
| 136 |
+
distances, indices = index.search(query_np, k)
|
| 137 |
+
|
| 138 |
+
print(f"Top {k} matches:", indices[0])
|
| 139 |
+
print(f"Scores:", distances[0])
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
## Performance
|
| 143 |
+
|
| 144 |
+
NetraEmbed achieves competitive performance on visual document retrieval benchmarks while being significantly faster than multi-vector approaches. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation.
|
| 145 |
+
|
| 146 |
+
## Citation
|
| 147 |
+
|
| 148 |
+
```bibtex
|
| 149 |
+
@misc{kolavi2025m3druniversalmultilingualmultimodal,
|
| 150 |
+
title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval},
|
| 151 |
+
author={Adithya S Kolavi and Vyoman Jain},
|
| 152 |
+
year={2025},
|
| 153 |
+
eprint={2512.03514},
|
| 154 |
+
archivePrefix={arXiv},
|
| 155 |
+
primaryClass={cs.IR},
|
| 156 |
+
url={https://arxiv.org/abs/2512.03514}
|
| 157 |
+
}
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
## License
|
| 161 |
+
|
| 162 |
+
This model is released under the same license as the base Gemma3 model.
|
| 163 |
+
|
| 164 |
+
## Acknowledgments
|
| 165 |
+
|
| 166 |
+
Built on top of the Gemma3 architecture with Matryoshka representation learning.
|