ColNetraEmbed / README.md

Update README.md

a27fe0b verified 3 days ago

9.37 kB

	---
	language:
	- en
	- es
	- fr
	- de
	- it
	- hi
	- mr
	- sa
	- kn
	- te
	- ta
	- ml
	- zh
	- ja
	- ko
	- ar
	- bn
	- gu
	- or
	- pa
	- ru
	- th
	license: gemma
	library_name: transformers
	tags:
	- vision-language
	- retrieval
	- colbert
	- late-interaction
	- multimodal
	- multilingual
	- document-retrieval
	- 22-languages
	pipeline_tag: visual-document-retrieval
	base_model:
	- google/gemma-3-4b-it

	datasets:
	- Cognitive-Lab/nayanair-bench
	model-index:
	- name: ColNetraEmbed
	results:
	- task:
	type: image-text-retrieval
	name: Cross-Lingual Document Retrieval
	dataset:
	type: Cognitive-Lab/nayanair-bench
	name: Nayana-IR Cross-Lingual
	split: test
	metrics:
	- type: ndcg_at_5
	value: 0.637
	name: NDCG@5
	- type: recall_at_10
	value: 0.700
	name: Recall@10
	- type: map_at_10
	value: 0.610
	name: MAP@10
	- type: mrr_at_10
	value: 0.610
	name: MRR@10
	- task:
	type: image-text-retrieval
	name: Monolingual Document Retrieval
	dataset:
	type: Cognitive-Lab/nayanair-bench
	name: Nayana-IR Monolingual
	split: test
	metrics:
	- type: ndcg_at_5
	value: 0.670
	name: NDCG@5
	- type: recall_at_10
	value: 0.764
	name: Recall@10
	- type: map_at_10
	value: 0.645
	name: MAP@10
	- type: mrr_at_10
	value: 0.686
	name: MRR@10
	- task:
	type: image-text-retrieval
	name: English Document Retrieval
	dataset:
	type: vidore/vidore-benchmark
	name: ViDoRe v2
	split: test
	metrics:
	- type: ndcg_at_5
	value: 0.551
	name: NDCG@5
	- type: recall_at_10
	value: 0.664
	name: Recall@10
	- type: map_at_10
	value: 0.445
	name: MAP@10
	- type: mrr_at_10
	value: 0.445
	name: MRR@10
	---
	# ColNetraEmbed

	![Group 54 (1)](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/-fYMikXhSuqRqm-UIdulK.png)


	[![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514)
	[![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali)
	[![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/ColNetraEmbed)
	[![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed)
	[![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://huggingface.co/spaces/AdithyaSK/NetraEmbed)
	[![Colab](https://img.shields.io/badge/Colab-Run%20Code-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/ColNetraEmbed/blob/main/ColNetraEmbed_InferenceDemo.ipynb)
	[![Colab](https://img.shields.io/badge/Colab-Gradio%20Demo-F9AB00?logo=googlecolab&logoColor=white)](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_Gradio_Demo_final.ipynb)


	ColNetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.

	## Model Description

	ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).

	- Model Type: Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
	- Architecture: ColPali with Gemma3-4B backbone
	- Embedding Dimension: 128 per token
	- Capabilities: Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
	- Use Case: Visual document retrieval, multilingual document understanding, fine-grained visual search

	## Paper

	📄 [M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)

	## Installation

	```bash
	pip install git+https://github.com/adithya-s-k/colpali.git
	```

	## Quick Start

	```python
	import torch
	from PIL import Image
	from colpali_engine.models import ColGemma3, ColGemmaProcessor3

	# Load model and processor
	model_name = "Cognitive-Lab/ColNetraEmbed"
	model = ColGemma3.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	)
	processor = ColGemmaProcessor3.from_pretrained(model_name)

	# Load your images
	images = [
	Image.open("document1.jpg"),
	Image.open("document2.jpg"),
	]

	# Define queries
	queries = [
	"What is the total revenue?",
	"Show me the organizational chart",
	]

	# Process and encode
	batch_images = processor.process_images(images).to(model.device)
	batch_queries = processor.process_queries(queries).to(model.device)

	with torch.no_grad():
	image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128)
	query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128)

	# Compute similarity scores using MaxSim
	scores = processor.score_multi_vector(
	qs=query_embeddings,
	ps=image_embeddings,
	) # Shape: (num_queries, num_images)

	# Get best matches
	for i, query in enumerate(queries):
	best_idx = scores[i].argmax().item()
	print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")
	```

	## Use Cases

	- Document Retrieval: Search through large collections of visual documents
	- Visual Question Answering: Answer questions about document content
	- Document Understanding: Extract and match information from scanned documents
	- Cross-lingual Document Search: Multilingual visual document retrieval

	## Model Details

	- Base Model: [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
	- Vision Encoder: SigLIP
	- Training Data: Multilingual document datasets
	- Embedding Strategy: Multi-vector (Late Interaction)
	- Similarity Function: MaxSim (Maximum Similarity)

	## Performance

	ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.

	### Benchmark Results

	Nayana-IR Cross-Lingual

	\| Model \| NDCG@5 \| Recall@10 \| MAP@10 \| MRR@10 \|
	\|-------\|:------:\|:---------:\|:------:\|:------:\|
	\| ColNetraEmbed \| 0.637 \| 0.700 \| 0.610 \| 0.610 \|
	\| Jina-Embeddings-v4 \| 0.435 \| 0.435 \| 0.390 \| 0.548 \|
	\| ColNomic-Embed-3B \| 0.315 \| 0.320 \| 0.267 \| 0.444 \|
	\| ColPali-v1.3 \| 0.284 \| 0.347 \| 0.249 \| 0.403 \|
	\| GME-Qwen2-VL-2B \| 0.235 \| 0.308 \| 0.209 \| 0.314 \|
	\| ColQwen2.5-v0.2 \| 0.143 \| 0.160 \| 0.127 \| 0.220 \|
	\| ColQwen2-v1.0 \| 0.050 \| 0.065 \| 0.038 \| 0.109 \|

	Nayana-IR Monolingual

	\| Model \| NDCG@5 \| Recall@10 \| MAP@10 \| MRR@10 \|
	\|-------\|:------:\|:---------:\|:------:\|:------:\|
	\| ColNetraEmbed \| 0.670 \| 0.764 \| 0.645 \| 0.686 \|
	\| ColNomic-Embed-3B \| 0.534 \| 0.603 \| 0.515 \| 0.546 \|
	\| ColQwen2.5-v0.2 \| 0.453 \| 0.513 \| 0.437 \| 0.464 \|
	\| GME-Qwen2-VL-2B \| 0.444 \| 0.525 \| 0.426 \| 0.452 \|
	\| ColQwen2-v1.0 \| 0.413 \| 0.466 \| 0.398 \| 0.422 \|
	\| ColPali-v1.3 \| 0.410 \| 0.484 \| 0.393 \| 0.422 \|

	ViDoRe v2

	\| Model \| NDCG@5 \| Recall@10 \| MAP@10 \| MRR@10 \|
	\|-------\|:------:\|:---------:\|:------:\|:------:\|
	\| ColQwen2.5-v0.2 \| 0.592 \| 0.664 \| 0.484 \| 0.711 \|
	\| Jina-Embeddings-v4 \| 0.576 \| 0.686 \| - \| - \|
	\| GME-Qwen2-VL-2B \| 0.574 \| 0.630 \| 0.466 \| 0.690 \|
	\| ColNomic-Embed-3B \| 0.556 \| 0.633 \| 0.451 \| 0.672 \|
	\| ColNetraEmbed \| 0.551 \| 0.664 \| 0.445 \| 0.445 \|
	\| ColQwen2-v1.0 \| 0.545 \| 0.640 \| 0.438 \| 0.653 \|
	\| ColPali-v1.3 \| 0.538 \| 0.627 \| 0.436 \| 0.644 \|

	Key Results:
	- 🏆 Strong multilingual performance with ColBERT-style late interaction
	- 📈 124% improvement over ColPali-v1.3 on cross-lingual tasks
	- 🌍 Supports 22 languages across diverse script families
	- 🔍 Fine-grained matching through token-level MaxSim scoring

	Comparison: Multi-vector vs Single-vector
	- ColNetraEmbed (multi-vector): More interpretable with token-level attribution
	- NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage

	See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons.

	## Citation

	```bibtex
	@misc{kolavi2025m3druniversalmultilingualmultimodal,
	title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval},
	author={Adithya S Kolavi and Vyoman Jain},
	year={2025},
	eprint={2512.03514},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2512.03514}
	}
	```

	## License

	This model is released under the same license as the base Gemma3 model.

	## Acknowledgments

	This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).

	Built on top of the ColPali framework and Gemma3 architecture.