Metric-AI
/

ColQwenStella-2b-multilingual

Visual Document Retrieval

multimodal_embedding

multilingual_embedding

Text-to-Visual Document (T→VD) retrieval

Model card Files Files and versions

ColQwenStella-2b-multilingual / README.md

Markgazol's picture

Update README.md

8295033 verified about 1 year ago

|

history blame contribute delete

3.5 kB

	---
	license: mit
	datasets:
	- openbmb/VisRAG-Ret-Train-Synthetic-data
	- openbmb/VisRAG-Ret-Train-In-domain-data
	- Metric-AI/rag_docmatix_100k
	- vidore/colpali_train_set
	- llamaindex/vdr-multilingual-train
	- Metric-AI/tabfquad_train_set
	language:
	- en
	- fr
	- es
	- it
	- de
	base_model:
	- Metric-AI/ColQwenStella-base-2b
	- Qwen/Qwen2-VL-2B
	- NovaSearch/stella_en_1.5B_v5
	tags:
	- vidore
	- multimodal_embedding
	- multilingual_embedding
	- Text-to-Visual Document (T→VD) retrieval
	library_name: peft
	pipeline_tag: visual-document-retrieval
	---
	# ColQwenStella-2b-multilingual: Multilingual Visual Retriever based on the combination of Qwen2 Vision and stella_en_1.5B_v5 model.

	## Ranked #1 among models <= 2B parameters and #8 overall on the Vidore benchmark (as of February 11, 2025). The reported scores on the [Vidore Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard) correspond to checkpoint-1800.

	### This is the base version trained on 4xA100 80GB with per_device_batch_size=128 for 5 epoch.

	The ColQwenStella-2b-multilingual architecture combines the Vision component of the Qwen2 model with stella_en_1.5B_v5 as its embedding model. Training is done following the [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) recipe.


	## Data
	- Synthetic data: Selected and preprocessed from the `openbmb/VisRAG-Ret-Train-Synthetic-data` dataset.
	- In-domain VQA dataset: Drawn from `openbmb/VisRAG-Ret-Train-In-domain-data`.
	- Docmatix dataset: Extracted from the `Metric-AI/rag_docmatix_100k` dataset.
	- Colpali dataset: Taken from `vidore/colpali_train_set`.
	- Multilingual dataset: Taken from `llamaindex/vdr-multilingual-train`.


	## Model Training

	### Parameters
	We train models use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685))
	with `alpha=128` and `r=128` on the transformer layers from the language model, and `mlp` layers of the `vison_model.merger`
	as well as the final randomly initialized projection layer, and use a `adamw` optimizer.
	We train on an 4xA100 GPU setup with distributed data parallelism (via accelerate), a learning rate of 5e-4 with cosine decay with 100 warmup steps, batch size per device is 128, in `bfloat16` format.

	## Installation

	```bash
	pip install transformers>=4.46.3
	```

	## Usage

	```python
	import torch
	from PIL import Image

	from transformers import AutoModel, AutoProcessor

	model = AutoModel.from_pretrained(
	"Metric-AI/ColQwenStella-2b-multilingual",
	torch_dtype=torch.bfloat16,
	device_map="cuda:0", # or "mps" if on Apple Silicon
	trust_remote_code=True
	).eval()
	processor = AutoProcessor.from_pretrained("Metric-AI/ColQwenStella-2b-multilingual", trust_remote_code=True)

	# Your inputs
	images = [
	Image.new("RGB", (32, 32), color="white"),
	Image.new("RGB", (16, 16), color="black"),
	]
	queries = [
	"Is attention really all you need?",
	"What is the amount of bananas farmed in Salvador?",
	]

	# Process the inputs
	batch_images = processor.process_images(images).to(model.device)
	batch_queries = processor.process_queries(queries).to(model.device)

	# Forward pass
	with torch.no_grad():
	image_embeddings = model(**batch_images)
	query_embeddings = model(**batch_queries)

	scores = processor.score_multi_vector(query_embeddings, image_embeddings)
	```

	## License

	The adapters attached to the model are under MIT license.


	- Developed by: [Metric AI Research Lab](https://metric.am/)