Argus-Colqwen3.5-2b-v0 · bf16 release
Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026
DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-2b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-2b-v0; only the LoRA-merge dtype differs.
The bf16 release is half the disk size (4.6 GB vs 9.3 GB), faster to download, and easier to deploy on memory-constrained GPUs.
| Disk | V1 avg nDCG@5 | V2 avg nDCG@5 | |
|---|---|---|---|
fp32 sibling (-v0) |
9.3 GB | 0.9149 | 0.6152 |
this release (-v0-bf16) |
4.6 GB | 0.9149 | 0.6152 |
| Δ vs fp32 | −4.7 GB | 0.0000 | 0.0000 |
At the 2B scale the merge-time bf16 cast lands inside the nDCG@5 measurement noise floor — every per-task score is bit-identical to the fp32 sibling (see the per-task table below). The bf16 release is therefore the strict winner for almost every deployment scenario at this size.
Use this bf16 release unless you specifically need fp32 for downstream merging / quantisation chains.
TL;DR — leaderboard standing
- Strong on the ViDoRe v1 leaderboard at the 2B scale (V1 = 0.9149) — competitive with nomic-ai/colnomic-embed-multimodal-3b (V1 = 0.916) at 2/3 the parameter count.
- Best 2B-class result on V2 (V2 = 0.6152), comfortably ahead of vidore/colpali-v1.3 and Metric-AI/colqwen2.5-3b-multilingual.
- 2.3 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 8 GB GPU at bf16 inference.
- Apache 2.0, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.
What is novel here
Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:
- Region pooling — visual tokens are grouped into 4-token regions before routing.
- Query-conditioned latent gating (
GateScalars) — router input isregion + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page. - Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
- Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
- 3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation.
Router sits at backbone layer −5.
Model details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-VL-2B-Instruct |
| Total parameters | 2.32 B |
| Per-token embedding dim | 1024 |
| Max visual tokens / page | 2048 |
| Max text tokens | 32 768 |
| Similarity function | MaxSim |
| MoE specialists | 4 latent + 1 shared dense |
| Top-k experts per token | 2 |
| Region size | 4 |
| Router placement | backbone layer −5 |
| Weight precision (this release) | bfloat16 |
| Adapted from | DataScience-UIBK/Argus-Colqwen3.5-2b-v0 (fp32 merge) |
| License | Apache 2.0 |
| Model size on disk | ~4.6 GB |
| VRAM @ bf16 inference | ~5.5 GB |
Why two dtypes (and why bf16 is essentially free at 2B)
Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.
- In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
- In fp32, both are exact.
For the 4B sibling, this merge cost shows up as ~0.1 pp on V1 and ~0.2 pp on V2. For the 2B model it does not — every per-task ViDoRe score is bit-identical to the fp32 sibling. We attribute this to the small 2B model not having "borderline" routing decisions that the bf16 cast can flip; the late-interaction MaxSim pooling further averages out the residual noise across many tokens.
So at the 2B scale the bf16 release is the strict winner: same scores, half the disk, faster I/O. We still publish fp32 for users who specifically need maximum-precision weights for downstream merging or quantisation.
Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)
Per-task scores measured with mteb 2.12 on the published bf16 weights, side-by-side with the fp32 sibling and the 4B variants for transparency.
| Task | 2B fp32 | 2B bf16 (this) | 4B fp32 | 4B bf16 |
|---|---|---|---|---|
| ArxivQA | 0.9027 | 0.9027 | 0.9095 | 0.9126 |
| DocVQA | 0.6747 | 0.6747 | 0.6770 | 0.6779 |
| InfoVQA | 0.9497 | 0.9497 | 0.9463 | 0.9447 |
| ShiftProject | 0.9133 | 0.9133 | 0.9470 | 0.9346 |
| SyntheticDocQA-AI | 0.9963 | 0.9963 | 0.9963 | 0.9926 |
| SyntheticDocQA-Energy | 0.9726 | 0.9726 | 0.9789 | 0.9750 |
| SyntheticDocQA-Government | 0.9729 | 0.9729 | 0.9779 | 0.9779 |
| SyntheticDocQA-Healthcare | 0.9926 | 0.9926 | 0.9963 | 0.9963 |
| TabFQuAD | 0.9336 | 0.9336 | 0.9533 | 0.9544 |
| TatDQA | 0.8403 | 0.8403 | 0.8480 | 0.8485 |
| Average | 0.9149 | 0.9149 | 0.9230 | 0.9214 |
ViDoRe v1 — 2B / 3B-class leaderboard comparison
| Rank | Model | Params | dim | V1 avg |
|---|---|---|---|---|
| 1 | nomic-ai/colnomic-embed-multimodal-3b | 3.0 B | 128 | 0.916 |
| 2 | Argus-Colqwen3.5-2b-v0-bf16 (this) | 2.3 B | 1024 | 0.9149 |
| 3 | Metric-AI/colqwen2.5-3b-multilingual | 3.1 B | 128 | 0.892 |
| 4 | vidore/colpali-v1.3 | 2.9 B | 128 | 0.844 |
Argus matches nomic's 3B-class result at smaller scale and a wider per-token dim, and is the strongest sub-3B retriever published to date.
Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)
| Task | 2B fp32 | 2B bf16 (this) | 4B fp32 | 4B bf16 |
|---|---|---|---|---|
| BioMedicalLectures | 0.6499 | 0.6499 | 0.6438 | 0.6349 |
| ESGReports-HighLevel | 0.6936 | 0.6936 | 0.6991 | 0.7079 |
| ESGReports | 0.5988 | 0.5988 | 0.6218 | 0.6175 |
| EconomicsReports | 0.5186 | 0.5186 | 0.5980 | 0.5918 |
| Average | 0.6152 | 0.6152 | 0.6407 | 0.6380 |
ViDoRe v2 — 2B / 3B-class context
| Model | V2 avg |
|---|---|
| Argus-Colqwen3.5-2b-v0-bf16 | 0.6152 |
| nomic-ai/colnomic-embed-multimodal-3b | 0.616 |
| Metric-AI/colqwen2.5-3b-multilingual | 0.580 |
ViDoRe v3
Not yet evaluated for this release.
Storage cost
| Model | Tokens/page | Dim | Bytes/page (bf16 embeddings) |
|---|---|---|---|
| Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB |
| Argus-Colqwen3.5-4b-v0-bf16 | 2048 | 1024 | 4.2 MB |
| Argus-Colqwen3.5-2b-v0-bf16 | 2048 | 1024 | 4.2 MB |
| TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB |
Per-page storage is identical between the 2B and 4B Argus releases — both use the same 1024-d head and same 2048-token visual budget. The choice between them is inference cost (2B is ~50 % faster than 4B), not corpus storage.
Installation
pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules
Usage
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # this release ships in bf16
attn_implementation="flash_attention_2",
device_map=DEVICE,
).eval()
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=2048,
)
queries = [
"What is the company's revenue in 2019?",
"How does the proposed model compare to baselines?",
]
documents = [
Image.open("page_a.png").convert("RGB"),
Image.open("page_b.png").convert("RGB"),
]
q_emb = model.encode_queries(processor, queries)
d_emb = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)
Reproduce ViDoRe results with MTEB
import mteb
m = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})
Training
Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.
When to use 2B-bf16 vs the other variants
| Use case | Recommendation |
|---|---|
| Smallest deployable artefact at the 2B scale | 2B bf16 (this) ← strict winner: same scores as fp32, half the disk |
| Maximum precision at 2B for downstream merging / quantisation | 2B fp32 |
| Maximum recall on document QA / leaderboard parity | 4B fp32 |
| Latency-sensitive retrieval at the 4B scale | 4B bf16 |
Limitations
- English-dominant.
- Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load. (At 2B this does not matter — the bf16 cast is numerically equivalent to fp32 on ViDoRe; it does matter at 4B.)
- ViDoRe v3 not yet evaluated.
License
Apache 2.0, inherited from Qwen3.5-VL-2B-Instruct.
Citation
@misc{argus2026,
title = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
author = {DataScience-UIBK team},
year = {2026},
url = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0},
}
Contact
- Org: DataScience-UIBK, University of Innsbruck
- Issues: open one on this repo's Community tab.
- Downloads last month
- 25