AdithyaSK commited on
Commit
7ea8536
·
verified ·
1 Parent(s): f73a239

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -1
README.md CHANGED
@@ -8,4 +8,159 @@ pipeline_tag: visual-document-retrieval
8
  library_name: transformers
9
  ---
10
 
11
- # NetraEmbed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  library_name: transformers
9
  ---
10
 
11
+ # NetraEmbed
12
+
13
+ **NetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.
14
+
15
+ ## Model Description
16
+
17
+ NetraEmbed is a multilingual multimodal embedding model that encodes both visual documents and text queries into single dense vectors. It supports multiple languages and enables efficient similarity search at multiple embedding dimensions (768, 1536, 2560) through Matryoshka representation learning.
18
+
19
+ - **Model Type:** Multilingual Multimodal Embedding Model with Matryoshka embeddings
20
+ - **Architecture:** BiEncoder with Gemma3-2B backbone
21
+ - **Embedding Dimensions:** 768, 1536, 2560 (Matryoshka)
22
+ - **Capabilities:** Multilingual, Multimodal (Vision + Text)
23
+ - **Use Case:** Visual document retrieval, multilingual semantic search, cross-lingual document understanding
24
+
25
+ ## Paper
26
+
27
+ 📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)**
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ pip install git+https://github.com/adithya-s-k/colpali.git
33
+ ```
34
+
35
+ ## Quick Start
36
+
37
+ ```python
38
+ import torch
39
+ from PIL import Image
40
+ from colpali_engine.models import BiGemma3, BiGemmaProcessor3
41
+
42
+ # Load model and processor
43
+ model_name = "Cognitive-Lab/NetraEmbed"
44
+
45
+ # Choose embedding dimension: 768, 1536, or 2560
46
+ embedding_dim = 1536 # Use lower dims for faster search, higher for better accuracy
47
+
48
+ model = BiGemma3.from_pretrained(
49
+ model_name,
50
+ dtype=torch.bfloat16,
51
+ device_map="cuda",
52
+ embedding_dim=embedding_dim, # Matryoshka dimension
53
+ )
54
+ processor = BiGemmaProcessor3.from_pretrained(model_name)
55
+
56
+ # Load your images
57
+ images = [
58
+ Image.open("document1.jpg"),
59
+ Image.open("document2.jpg"),
60
+ ]
61
+
62
+ # Define queries
63
+ queries = [
64
+ "What is the total revenue?",
65
+ "Show me the organizational chart",
66
+ ]
67
+
68
+ # Process and encode
69
+ batch_images = processor.process_images(images).to(model.device)
70
+ batch_queries = processor.process_texts(queries).to(model.device)
71
+
72
+ with torch.no_grad():
73
+ image_embeddings = model(**batch_images) # Shape: (num_images, embedding_dim)
74
+ query_embeddings = model(**batch_queries) # Shape: (num_queries, embedding_dim)
75
+
76
+ # Compute similarity scores using cosine similarity
77
+ scores = processor.score(
78
+ qs=query_embeddings,
79
+ ps=image_embeddings,
80
+ ) # Shape: (num_queries, num_images)
81
+
82
+ # Get best matches
83
+ for i, query in enumerate(queries):
84
+ best_idx = scores[i].argmax().item()
85
+ print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")
86
+ ```
87
+
88
+ ## Matryoshka Embeddings
89
+
90
+ NetraEmbed supports three embedding dimensions:
91
+
92
+ | Dimension | Use Case | Speed | Accuracy |
93
+ |-----------|----------|-------|----------|
94
+ | 768 | Fast search, large-scale | ⚡⚡⚡ | ⭐⭐ |
95
+ | 1536 | Balanced performance | ⚡⚡ | ⭐⭐⭐ |
96
+ | 2560 | Maximum accuracy | ⚡ | ⭐⭐⭐⭐ |
97
+
98
+ Choose the dimension that best fits your latency and accuracy requirements. You can even switch dimensions without retraining!
99
+
100
+ ## Use Cases
101
+
102
+ - **Efficient Document Retrieval:** Fast search through millions of documents
103
+ - **Semantic Search:** Find visually similar documents
104
+ - **Scalable Vector Search:** Works with FAISS, Milvus, Pinecone, etc.
105
+ - **Cross-lingual Retrieval:** Multilingual visual document search
106
+
107
+ ## Model Details
108
+
109
+ - **Base Model:** Gemma3-2B
110
+ - **Vision Encoder:** SigLIP
111
+ - **Training Data:** Multilingual document datasets
112
+ - **Embedding Strategy:** Single-vector (BiEncoder)
113
+ - **Similarity Function:** Cosine similarity
114
+ - **Matryoshka Dimensions:** 768, 1536, 2560
115
+
116
+ ## Integration with Vector Databases
117
+
118
+ NetraEmbed works seamlessly with popular vector databases:
119
+
120
+ ```python
121
+ import faiss
122
+ import numpy as np
123
+
124
+ # Create FAISS index
125
+ dimension = 1536
126
+ index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
127
+
128
+ # Add image embeddings to index
129
+ embeddings_np = image_embeddings.cpu().numpy()
130
+ faiss.normalize_L2(embeddings_np) # Embeddings are already normalized
131
+ index.add(embeddings_np)
132
+
133
+ # Search
134
+ query_np = query_embeddings[0:1].cpu().numpy()
135
+ k = 5 # Top 5 results
136
+ distances, indices = index.search(query_np, k)
137
+
138
+ print(f"Top {k} matches:", indices[0])
139
+ print(f"Scores:", distances[0])
140
+ ```
141
+
142
+ ## Performance
143
+
144
+ NetraEmbed achieves competitive performance on visual document retrieval benchmarks while being significantly faster than multi-vector approaches. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation.
145
+
146
+ ## Citation
147
+
148
+ ```bibtex
149
+ @misc{kolavi2025m3druniversalmultilingualmultimodal,
150
+ title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval},
151
+ author={Adithya S Kolavi and Vyoman Jain},
152
+ year={2025},
153
+ eprint={2512.03514},
154
+ archivePrefix={arXiv},
155
+ primaryClass={cs.IR},
156
+ url={https://arxiv.org/abs/2512.03514}
157
+ }
158
+ ```
159
+
160
+ ## License
161
+
162
+ This model is released under the same license as the base Gemma3 model.
163
+
164
+ ## Acknowledgments
165
+
166
+ Built on top of the Gemma3 architecture with Matryoshka representation learning.