ColQwen3.5-2B-Embedding (LoRA Adapter)

A ColBERT-style multi-vector document retrieval model adapter fine-tuned on top of Qwen/Qwen3.5-2B-Base.

2B Parameters | LoRA Adapter (r=32, α=32) | Matryoshka Representation Learning

Description

Inspired by ColPali, this model encodes document page images into a sequence of contextualized patch embeddings and uses late interaction MaxSim scoring for retrieval.

Trained with Matryoshka Representation Learning so embeddings can be truncated to [128, 256, 512, 1024, 2048] dims without retraining, enabling flexible accuracy/speed tradeoffs.

Evaluations

All numbers are NDCG@5.

ViDoRe v1

Evaluated on the ViDoRe v1 benchmark (single relevant doc per query).

Dataset 128-dim 256-dim 512-dim 1024-dim 2048-dim
ArxivQA 0.8634 0.8716 0.8767 0.8776 0.8847
DocVQA 0.5879 0.5921 0.6024 0.5993 0.6007
InfoVQA 0.9055 0.9104 0.9115 0.9170 0.9120
Shift Project 0.8427 0.8535 0.8420 0.8610 0.8657
Synth AI 0.9889 0.9926 0.9926 0.9926 0.9926
Synth Energy 0.9659 0.9702 0.9659 0.9682 0.9689
Synth Gov 0.9223 0.9180 0.9304 0.9441 0.9485
Synth Health 0.9776 0.9776 0.9802 0.9839 0.9839
TabFQuAD 0.8741 0.8820 0.8782 0.8839 0.8852
TAT-DQA 0.7601 0.7677 0.7700 0.7718 0.7732
Average 0.8688 0.8736 0.8750 0.8799 0.8815

ViDoRe v2

Evaluated on the ViDoRe v2 benchmark (BEIR format, multi-relevant graded qrels — harder than v1).

v2 differences: each query has ~3.2 relevant pages on average, corpus sizes are 5–30× larger (452–3076 docs), and relevance is graded (score ≥ 1 = relevant).

Dataset Corpus Queries 128-dim 256-dim 512-dim 1024-dim 2048-dim
Biomedical Lectures 1016 640 0.5679 0.6011 0.6081 0.6083 0.6191
Economics Reports 452 232 0.5611 0.5724 0.5592 0.5659 0.5683
ESG Reports 1538 228 0.4816 0.4971 0.5256 0.5627 0.5647
ESG Reports (Human) 3076 104 0.4379 0.4384 0.4407 0.4457 0.4471
Average 0.5121 0.5273 0.5334 0.5457 0.5498

Combined Average (v1 + v2 macro)

128-dim 256-dim 512-dim 1024-dim 2048-dim
ViDoRe v1 avg 0.8688 0.8736 0.8750 0.8799 0.8815
ViDoRe v2 avg 0.5121 0.5273 0.5334 0.5457 0.5498
Overall avg 0.6905 0.7005 0.7042 0.7128 0.7157

Comparison with 0.8B variant

Model v1 avg (1024-dim) v2 avg (1024-dim)
ColQwen3.5-0.8B 874M params 0.8625 0.4806
ColQwen3.5-2B 2B params 0.8799 0.5457
Δ +0.0174 +0.0651

2B shows the largest gains on v2 (harder, multi-relevant) — consistent with larger models being more robust to harder retrieval settings.

Limitations

  • Training data: Fine-tuned on vidore/colpali_train_set for 1 epoch with a single A100-80GB. Covers scientific papers, reports, slides — real-world documents with complex layouts, handwriting, or non-English text may be out-of-distribution. No hard negatives used.
  • Language: Predominantly English training data. Performance on non-English documents is expected to degrade.
  • LoRA adapter: Must be loaded on top of the base Qwen/Qwen3.5-2B-Base weights.
  • Matryoshka tradeoff: Truncating to 128-dim incurs ~1.3% NDCG@5 drop vs 2048-dim on v1 (0.8688 vs 0.8815), and ~3.8% on v2 (0.5121 vs 0.5498).

Usage

Requirements

pillow
transformers==5.3.0
peft==0.18.1
qwen-vl-utils>=0.0.14
torch==2.8.0

Example

from embedder.colqwen3_5_embedder import ColQwen3_5Embedder

embedder = ColQwen3_5Embedder(
    model_name_or_path="Qwen/Qwen3.5-2B-Base",
    lora_checkpoint="leo-vnuuet/ColQwen3.5-2B-Embedding",
    embed_dim=128
)

queries = [
    {"text": "What is the quarterly revenue breakdown?"},
]

documents = [
    {"image": "/path/to/document_page.png"},
]

qry_emb, qry_mask = embedder.process(queries, normalize=True, pooling=False)
doc_emb, doc_mask = embedder.process(documents, normalize=True, pooling=False)

# scores shape: (num_queries, num_docs)
scores = embedder.score_maxsim(qry_emb, doc_emb, qry_mask, doc_mask)

print("Relevance scores:")
for q_idx, query in enumerate(queries):
    for d_idx, doc in enumerate(documents):
        print(f"  Q{q_idx+1} vs D{d_idx+1}: {scores[q_idx, d_idx].item():.4f}")

Training Details

Base model Qwen/Qwen3.5-2B-Base
Training data vidore/colpali_train_set (~118K pairs)
Epochs 1
Batch size 8 × 4 grad accum = effective 32
Learning rate 5e-5 (cosine, 2.5% warmup)
Optimizer paged_adamw_8bit
LoRA rank r=32, α=32
LoRA targets All linear layers (attention + MLP + DeltaNet)
Loss Matryoshka MaxSim (dims: 128, 256, 512, 1024, 2048 — equal weights)
Precision bfloat16
Hardware 1× NVIDIA A100-SXM4-80GB

Further Contributions

Contributions, experiments, and extensions are welcome

Disclaimer: While my core background is in Software and Systems Engineering, I am currently exploring the depths of training and fine-tuning Vision/Language Models. There is still much to master, and I highly welcome any constructive feedback or insights from the community! Thanks in advance 🤗🫶

License

Apache 2.0 (inherits from base model)

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leo-vnuuet/ColQwen3.5-2B-Embedding

Adapter
(2)
this model

Dataset used to train leo-vnuuet/ColQwen3.5-2B-Embedding

Collection including leo-vnuuet/ColQwen3.5-2B-Embedding

Paper for leo-vnuuet/ColQwen3.5-2B-Embedding