File size: 3,908 Bytes
90497a2
 
 
 
 
 
 
 
 
 
56d5d23
7d37ff2
 
90497a2
 
7d37ff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
language:
- en
license: gemma
library_name: transformers
tags:
- vision-language
- retrieval
- colbert
- late-interaction
pipeline_tag: visual-document-retrieval
base_model:
- google/gemma-3-4b-it
---

# ColNetraEmbed

**ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.

## Model Description

ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).

- **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
- **Architecture:** ColPali with Gemma3-2B backbone
- **Embedding Dimension:** 128 per token
- **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
- **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search

## Paper

📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)**

## Installation

```bash
pip install git+https://github.com/adithya-s-k/colpali.git
```

## Quick Start

```python
import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)  # Shape: (num_images, num_patches, 128)
    query_embeddings = model(**batch_queries)  # Shape: (num_queries, num_tokens, 128)

# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")
```

## Use Cases

- **Document Retrieval:** Search through large collections of visual documents
- **Visual Question Answering:** Answer questions about document content
- **Document Understanding:** Extract and match information from scanned documents
- **Cross-lingual Document Search:** Multilingual visual document retrieval

## Model Details

- **Base Model:** Gemma3-2B
- **Vision Encoder:** SigLIP
- **Training Data:** Multilingual document datasets
- **Embedding Strategy:** Multi-vector (Late Interaction)
- **Similarity Function:** MaxSim (Maximum Similarity)

## Performance

ColNetraEmbed achieves state-of-the-art results on visual document retrieval benchmarks. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation metrics.

## Citation

```bibtex
@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}
```

## License

This model is released under the same license as the base Gemma3 model.

## Acknowledgments

Built on top of the ColPali framework and Gemma3 architecture.