--- license: apache-2.0 datasets: - vidore/colpali_train_set - openbmb/VisRAG-Ret-Train-Synthetic-data - openbmb/VisRAG-Ret-Train-In-domain-data language: - en base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: visual-document-retrieval --- # Model Card for Model ID EvoQwen2.5-VL-Retriever-7B-v1 is a high-performance multimodal retrieval model built upon the Qwen2.5-VL-7B-Instruct backbone and employing multi-vector late-interaction. The model is fine-tuned by using an innovative evolutionary training framework (Evo-Retriever), enabling accurate retrieval of complex visual documents. ## Version Specificity • Base Model: ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1 • Parameter Size: 7 billion (7B) • Features: The 7B version delivers the highest retrieval accuracy across all evaluation benchmarks, making it ideal for applications with stringent performance requirements. ## Performance
Model ViDoRe V2 (nDCG@5) MMEB VisDoc (ndcg_linear@5)
ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1 63.00 75.96
ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1 65.24 77.10
## Usage Make sure that you have installed Transformers, Torch, Pillow, and colpali-engine.

  import torch
  from PIL import Image
  from transformers.utils.import_utils import is_flash_attn_2_available

  from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

  model_name = "ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1" 

  model = ColQwen2_5.from_pretrained(
      model_name,
      torch_dtype=torch.bfloat16,
      device_map="cuda:0",  # or "mps" if on Apple Silicon
      attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
  ).eval()

  processor = ColQwen2_5_Processor.from_pretrained(model_name)

  

# Your inputs

images = [ Image.new("RGB", (128, 128), color="white"), Image.new("RGB", (64, 32), color="black"), ] queries = [ "Is attention really all you need?", "What is the amount of bananas farmed in Salvador?", ]

# Process the inputs

batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass

with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings) print(scores)
## Parameters All models are fine-tuned by using the Evo-Retriever paradigm with a two-stage training schedule (one epoch per stage). Unless otherwise noted, parameter-efficient fine-tuning is achieved through low-rank adapters (LoRA) with a rank of 32 for both 3B and 7B models. Training is performed in bfloat16 precision with the paged_adamw_8bit optimizer on an 8-GPU H20 server, employing a data-parallel strategy, a learning rate of 2e-5, cosine decay, 2% warm-up steps, and a batch size of 32.