ApsaraStackMaaS
/

EvoQwen2.5-VL-Retriever-7B-v1

Visual Document Retrieval

Model card Files Files and versions

EvoQwen2.5-VL-Retriever-7B-v1 / README.md

Lanpo's picture

Rename model card to README.md

dde21a0 verified about 1 month ago

|

history blame contribute delete

3.25 kB

	---
	license: apache-2.0
	datasets:
	- vidore/colpali_train_set
	- openbmb/VisRAG-Ret-Train-Synthetic-data
	- openbmb/VisRAG-Ret-Train-In-domain-data
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: visual-document-retrieval
	---
	# Model Card for Model ID
	EvoQwen2.5-VL-Retriever-7B-v1 is a high-performance multimodal retrieval model built upon the Qwen2.5-VL-7B-Instruct backbone and employing multi-vector late-interaction. The model is fine-tuned by using an innovative evolutionary training framework (Evo-Retriever), enabling accurate retrieval of complex visual documents.
	## Version Specificity
	• Base Model: ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1

	• Parameter Size: 7 billion (7B)

	• Features: The 7B version delivers the highest retrieval accuracy across all evaluation benchmarks, making it ideal for applications with stringent performance requirements.

	## Performance
	<table border="1">
	<tr>
	<th>Model</th>
	<th>ViDoRe V2 (nDCG@5)</th>
	<th>MMEB VisDoc (ndcg_linear@5)</th>
	</tr>
	<tr>
	<th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1</th>
	<th>63.00</th>
	<th>75.96</th>
	</tr>
	<tr>
	<th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1</th>
	<th>65.24</th>
	<th>77.10</th>
	</tr>
	</table>

	## Usage
	Make sure that you have installed Transformers, Torch, Pillow, and colpali-engine.

	<body>
	<div class="code-container">
	<div class="line-numbers" id="lineNumbers"></div>
	<pre><code class="language-javascript">
	import torch
	from PIL import Image
	from transformers.utils.import_utils import is_flash_attn_2_available

	from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

	model_name = "ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1"

	model = ColQwen2_5.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="cuda:0", # or "mps" if on Apple Silicon
	attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
	).eval()

	processor = ColQwen2_5_Processor.from_pretrained(model_name)

	<p># Your inputs</p>
	images = [
	Image.new("RGB", (128, 128), color="white"),
	Image.new("RGB", (64, 32), color="black"),
	]
	queries = [
	"Is attention really all you need?",
	"What is the amount of bananas farmed in Salvador?",
	]

	<p># Process the inputs</p>
	batch_images = processor.process_images(images).to(model.device)
	batch_queries = processor.process_queries(queries).to(model.device)

	<p># Forward pass</p>
	with torch.no_grad():
	image_embeddings = model(**batch_images)
	query_embeddings = model(**batch_queries)

	scores = processor.score_multi_vector(query_embeddings, image_embeddings)
	print(scores)
	</div>

	## Parameters
	All models are fine-tuned by using the Evo-Retriever paradigm with a two-stage training schedule (one epoch per stage). Unless otherwise noted, parameter-efficient fine-tuning is achieved through low-rank adapters (LoRA) with a rank of 32 for both 3B and 7B models. Training is performed in bfloat16 precision with the paged_adamw_8bit optimizer on an 8-GPU H20 server, employing a data-parallel strategy, a learning rate of 2e-5, cosine decay, 2% warm-up steps, and a batch size of 32.