ColQwen3.5-2B-Embedding (LoRA Adapter)
A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of Qwen/Qwen3.5-2B-Base.
2B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning
Description
Inspired by ColPali, this model encodes document page images into a sequence of contextualized patch embeddings and uses late interaction MaxSim scoring for retrieval.
Trained with Matryoshka Representation Learning so embeddings can be truncated to [128, 256, 512, 1024, 2048] dims without retraining, enabling flexible accuracy/speed tradeoffs.
Evaluations
All numbers are NDCG@5.
ViDoRe v1
Evaluated on the ViDoRe v1 benchmark (single relevant doc per query).
| Dataset | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
|---|---|---|---|---|---|
| ArxivQA | 0.8634 | 0.8716 | 0.8767 | 0.8776 | 0.8847 |
| DocVQA | 0.5879 | 0.5921 | 0.6024 | 0.5993 | 0.6007 |
| InfoVQA | 0.9055 | 0.9104 | 0.9115 | 0.9170 | 0.9120 |
| Shift Project | 0.8427 | 0.8535 | 0.8420 | 0.8610 | 0.8657 |
| Synth AI | 0.9889 | 0.9926 | 0.9926 | 0.9926 | 0.9926 |
| Synth Energy | 0.9659 | 0.9702 | 0.9659 | 0.9682 | 0.9689 |
| Synth Gov | 0.9223 | 0.9180 | 0.9304 | 0.9441 | 0.9485 |
| Synth Health | 0.9776 | 0.9776 | 0.9802 | 0.9839 | 0.9839 |
| TabFQuAD | 0.8741 | 0.8820 | 0.8782 | 0.8839 | 0.8852 |
| TAT-DQA | 0.7601 | 0.7677 | 0.7700 | 0.7718 | 0.7732 |
| Average | 0.8688 | 0.8736 | 0.8750 | 0.8799 | 0.8815 |
ViDoRe v2
Evaluated on the ViDoRe v2 benchmark (BEIR format, multi-relevant graded qrels — harder than v1).
v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score ≥ 1 = relevant).
| Dataset | Corpus | Queries | 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim |
|---|---|---|---|---|---|---|---|
| Biomedical Lectures | 1016 | 640 | 0.5679 | 0.6011 | 0.6081 | 0.6083 | 0.6191 |
| Economics Reports | 452 | 232 | 0.5611 | 0.5724 | 0.5592 | 0.5659 | 0.5683 |
| ESG Reports | 1538 | 228 | 0.4816 | 0.4971 | 0.5256 | 0.5627 | 0.5647 |
| ESG Reports (Human) | 3076 | 104 | 0.4379 | 0.4384 | 0.4407 | 0.4457 | 0.4471 |
| Average | 0.5121 | 0.5273 | 0.5334 | 0.5457 | 0.5498 |
Combined Average (v1 + v2 macro)
| 128-dim | 256-dim | 512-dim | 1024-dim | 2048-dim | |
|---|---|---|---|---|---|
| ViDoRe v1 avg | 0.8688 | 0.8736 | 0.8750 | 0.8799 | 0.8815 |
| ViDoRe v2 avg | 0.5121 | 0.5273 | 0.5334 | 0.5457 | 0.5498 |
| Overall avg | 0.6905 | 0.7005 | 0.7042 | 0.7128 | 0.7157 |
Comparison with 0.8B variant
| Model | v1 avg (1024-dim) | v2 avg (1024-dim) | |
|---|---|---|---|
| ColQwen3.5-0.8B | 874M params | 0.8625 | 0.4806 |
| ColQwen3.5-2B | 2B params | 0.8799 | 0.5457 |
| Δ | +0.0174 | +0.0651 |
2B shows the largest gains on v2 (harder, multi-relevant) — consistent with larger models being more robust to harder retrieval settings.
Limitations
- Training data: Fine-tuned on vidore/colpali_train_set for 1 epoch with a single A100-80GB. Covers scientific papers, reports, slides — real-world documents with complex layouts, handwriting, or non-English text may be out-of-distribution. No hard negatives used.
- Language: Predominantly English training data. Performance on non-English documents is expected to degrade.
- LoRA adapter: Must be loaded on top of the base
Qwen/Qwen3.5-2B-Baseweights. - Matryoshka tradeoff: Truncating to 128-dim incurs ~1.3% NDCG@5 drop vs 2048-dim on v1 (0.8688 vs 0.8815), and ~3.8% on v2 (0.5121 vs 0.5498).
Usage
Requirements
pillow
transformers==5.3.0
peft==0.18.1
qwen-vl-utils>=0.0.14
torch==2.8.0
Example
from embedder.colqwen3_5_embedder import ColQwen3_5Embedder
embedder = ColQwen3_5Embedder(
model_name_or_path="Qwen/Qwen3.5-2B-Base",
lora_checkpoint="leo-vnuuet/ColQwen3.5-2B-Embedding",
embed_dim=128
)
queries = [
{"text": "What is the quarterly revenue breakdown?"},
]
documents = [
{"image": "/path/to/document_page.png"},
]
qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)
# scores shape: (num_queries, num_docs)
scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)
print("Relevance scores:")
for q_idx, query in enumerate(queries):
for d_idx, doc in enumerate(documents):
print(f" Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")
Training Details
| Base model | Qwen/Qwen3.5-2B-Base |
| Training data | vidore/colpali_train_set (~118K pairs) |
| Epochs | 1 |
| Batch size | 8 × 4 grad accum = effective 32 |
| Learning rate | 5e-5 (cosine, 2.5% warmup) |
| Optimizer | paged_adamw_8bit |
| LoRA rank | r=32, α=32 |
| LoRA targets | All linear layers (attention + MLP + DeltaNet) |
| Loss | Matryoshka MaxSim (dims: 128, 256, 512, 1024, 2048 — equal weights) |
| Precision | bfloat16 |
| Hardware | 1× NVIDIA A100-SXM4-80GB |
Further Contributions
Contributions, experiments, and extensions are welcome
Disclaimer: While my core background is in Software and Systems Engineering, I am currently exploring the depths of training and fine-tuning Vision/Language Models. There is still much to master, and I highly welcome any constructive feedback or insights from the community! Thanks in advance 🤗🫶
License
Apache 2.0 (inherits from base model)
- Downloads last month
- 28
Model tree for leo-vnuuet/ColQwen3.5-2B-Embedding
Base model
Qwen/Qwen3.5-2B-Base