ColQwen3.5-2B-Embedding (LoRA Adapter)

A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of Qwen/Qwen3.5-2B-Base.

2B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning

Description

Inspired by ColPali, this model encodes document page images into a sequence of contextualized patch embeddings and uses late interaction MaxSim scoring for retrieval.

Trained with Matryoshka Representation Learning so embeddings can be truncated to [128, 256, 512, 1024, 2048] dims without retraining, enabling flexible accuracy/speed tradeoffs.

Evaluations

All numbers are NDCG@5.

ViDoRe v1

Evaluated on the ViDoRe v1 benchmark (single relevant doc per query).

Dataset	128-dim	256-dim	512-dim	1024-dim	2048-dim
ArxivQA	0.8634	0.8716	0.8767	0.8776	0.8847
DocVQA	0.5879	0.5921	0.6024	0.5993	0.6007
InfoVQA	0.9055	0.9104	0.9115	0.9170	0.9120
Shift Project	0.8427	0.8535	0.8420	0.8610	0.8657
Synth AI	0.9889	0.9926	0.9926	0.9926	0.9926
Synth Energy	0.9659	0.9702	0.9659	0.9682	0.9689
Synth Gov	0.9223	0.9180	0.9304	0.9441	0.9485
Synth Health	0.9776	0.9776	0.9802	0.9839	0.9839
TabFQuAD	0.8741	0.8820	0.8782	0.8839	0.8852
TAT-DQA	0.7601	0.7677	0.7700	0.7718	0.7732
Average	0.8688	0.8736	0.8750	0.8799	0.8815

ViDoRe v2

Evaluated on the ViDoRe v2 benchmark (BEIR format, multi-relevant graded qrels — harder than v1).

v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score ≥ 1 = relevant).

Dataset	Corpus	Queries	128-dim	256-dim	512-dim	1024-dim	2048-dim
Biomedical Lectures	1016	640	0.5679	0.6011	0.6081	0.6083	0.6191
Economics Reports	452	232	0.5611	0.5724	0.5592	0.5659	0.5683
ESG Reports	1538	228	0.4816	0.4971	0.5256	0.5627	0.5647
ESG Reports (Human)	3076	104	0.4379	0.4384	0.4407	0.4457	0.4471
Average			0.5121	0.5273	0.5334	0.5457	0.5498

Combined Average (v1 + v2 macro)

	128-dim	256-dim	512-dim	1024-dim	2048-dim
ViDoRe v1 avg	0.8688	0.8736	0.8750	0.8799	0.8815
ViDoRe v2 avg	0.5121	0.5273	0.5334	0.5457	0.5498
Overall avg	0.6905	0.7005	0.7042	0.7128	0.7157

Comparison with 0.8B variant

	Model	v1 avg (1024-dim)	v2 avg (1024-dim)
ColQwen3.5-0.8B	874M params	0.8625	0.4806
ColQwen3.5-2B	2B params	0.8799	0.5457
Δ		+0.0174	+0.0651

2B shows the largest gains on v2 (harder, multi-relevant) — consistent with larger models being more robust to harder retrieval settings.

Limitations

Training data: Fine-tuned on vidore/colpali_train_set for 1 epoch with a single A100-80GB. Covers scientific papers, reports, slides — real-world documents with complex layouts, handwriting, or non-English text may be out-of-distribution. No hard negatives used.
Language: Predominantly English training data. Performance on non-English documents is expected to degrade.
LoRA adapter: Must be loaded on top of the base Qwen/Qwen3.5-2B-Base weights.
Matryoshka tradeoff: Truncating to 128-dim incurs ~1.3% NDCG@5 drop vs 2048-dim on v1 (0.8688 vs 0.8815), and ~3.8% on v2 (0.5121 vs 0.5498).

Usage

Requirements

pillow
transformers==5.3.0
peft==0.18.1
qwen-vl-utils>=0.0.14
torch==2.8.0

Example

from embedder.colqwen3_5_embedder import ColQwen3_5Embedder

embedder = ColQwen3_5Embedder(
    model_name_or_path="Qwen/Qwen3.5-2B-Base",
    lora_checkpoint="leo-vnuuet/ColQwen3.5-2B-Embedding",
    embed_dim=128
)

queries = [
    {"text": "What is the quarterly revenue breakdown?"},
]

documents = [
    {"image": "/path/to/document_page.png"},
]

qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)

# scores shape: (num_queries, num_docs)
scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)

print("Relevance scores:")
for q_idx, query in enumerate(queries):
    for d_idx, doc in enumerate(documents):
        print(f"  Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")

Training Details


Base model	Qwen/Qwen3.5-2B-Base
Training data	vidore/colpali_train_set (~118K pairs)
Epochs	1
Batch size	8 × 4 grad accum = effective 32
Learning rate	5e-5 (cosine, 2.5% warmup)
Optimizer	paged_adamw_8bit
LoRA rank	r=32, α=32
LoRA targets	All linear layers (attention + MLP + DeltaNet)
Loss	Matryoshka MaxSim (dims: 128, 256, 512, 1024, 2048 — equal weights)
Precision	bfloat16
Hardware	1× NVIDIA A100-SXM4-80GB

Further Contributions

Contributions, experiments, and extensions are welcome

Disclaimer: While my core background is in Software and Systems Engineering, I am currently exploring the depths of training and fine-tuning Vision/Language Models. There is still much to master, and I highly welcome any constructive feedback or insights from the community! Thanks in advance 🤗🫶