|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- vidore/colpali_train_set |
|
|
- openbmb/VisRAG-Ret-Train-Synthetic-data |
|
|
- openbmb/VisRAG-Ret-Train-In-domain-data |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
|
pipeline_tag: visual-document-retrieval |
|
|
--- |
|
|
# Model Card for Model ID |
|
|
EvoQwen2.5-VL-Retriever-7B-v1 is a high-performance multimodal retrieval model built upon the Qwen2.5-VL-7B-Instruct backbone and employing multi-vector late-interaction. The model is fine-tuned by using an innovative evolutionary training framework (Evo-Retriever), enabling accurate retrieval of complex visual documents. |
|
|
## Version Specificity |
|
|
• Base Model: ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1 |
|
|
|
|
|
• Parameter Size: 7 billion (7B) |
|
|
|
|
|
• Features: The 7B version delivers the highest retrieval accuracy across all evaluation benchmarks, making it ideal for applications with stringent performance requirements. |
|
|
|
|
|
## Performance |
|
|
<table border="1"> |
|
|
<tr> |
|
|
<th>Model</th> |
|
|
<th>ViDoRe V2 (nDCG@5)</th> |
|
|
<th>MMEB VisDoc (ndcg_linear@5)</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1</th> |
|
|
<th>63.00</th> |
|
|
<th>75.96</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th>ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1</th> |
|
|
<th>65.24</th> |
|
|
<th>77.10</th> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
## Usage |
|
|
Make sure that you have installed Transformers, Torch, Pillow, and colpali-engine. |
|
|
|
|
|
<body> |
|
|
<div class="code-container"> |
|
|
<div class="line-numbers" id="lineNumbers"></div> |
|
|
<pre><code class="language-javascript"> |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers.utils.import_utils import is_flash_attn_2_available |
|
|
|
|
|
from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor |
|
|
|
|
|
model_name = "ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-3B-v1" |
|
|
|
|
|
model = ColQwen2_5.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="cuda:0", # or "mps" if on Apple Silicon |
|
|
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None, |
|
|
).eval() |
|
|
|
|
|
processor = ColQwen2_5_Processor.from_pretrained(model_name) |
|
|
|
|
|
<p># Your inputs</p> |
|
|
images = [ |
|
|
Image.new("RGB", (128, 128), color="white"), |
|
|
Image.new("RGB", (64, 32), color="black"), |
|
|
] |
|
|
queries = [ |
|
|
"Is attention really all you need?", |
|
|
"What is the amount of bananas farmed in Salvador?", |
|
|
] |
|
|
|
|
|
<p># Process the inputs</p> |
|
|
batch_images = processor.process_images(images).to(model.device) |
|
|
batch_queries = processor.process_queries(queries).to(model.device) |
|
|
|
|
|
<p># Forward pass</p> |
|
|
with torch.no_grad(): |
|
|
image_embeddings = model(**batch_images) |
|
|
query_embeddings = model(**batch_queries) |
|
|
|
|
|
scores = processor.score_multi_vector(query_embeddings, image_embeddings) |
|
|
print(scores) |
|
|
</div> |
|
|
|
|
|
## Parameters |
|
|
All models are fine-tuned by using the Evo-Retriever paradigm with a two-stage training schedule (one epoch per stage). Unless otherwise noted, parameter-efficient fine-tuning is achieved through low-rank adapters (LoRA) with a rank of 32 for both 3B and 7B models. Training is performed in bfloat16 precision with the paged_adamw_8bit optimizer on an 8-GPU H20 server, employing a data-parallel strategy, a learning rate of 2e-5, cosine decay, 2% warm-up steps, and a batch size of 32. |
|
|
|