Argus-Colqwen3.5-2b-v0 · bf16 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-2b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-2b-v0; only the LoRA-merge dtype differs.

The bf16 release is half the disk size (4.6 GB vs 9.3 GB), faster to download, and easier to deploy on memory-constrained GPUs.

	Disk	V1 avg nDCG@5	V2 avg nDCG@5
fp32 sibling (`-v0`)	9.3 GB	0.9149	0.6152
this release (`-v0-bf16`)	4.6 GB	0.9149	0.6152
Δ vs fp32	−4.7 GB	0.0000	0.0000

At the 2B scale the merge-time bf16 cast lands inside the nDCG@5 measurement noise floor — every per-task score is bit-identical to the fp32 sibling (see the per-task table below). The bf16 release is therefore the strict winner for almost every deployment scenario at this size.

Use this bf16 release unless you specifically need fp32 for downstream merging / quantisation chains.

TL;DR — leaderboard standing

Strong on the ViDoRe v1 leaderboard at the 2B scale (V1 = 0.9149) — competitive with nomic-ai/colnomic-embed-multimodal-3b (V1 = 0.916) at 2/3 the parameter count.
Best 2B-class result on V2 (V2 = 0.6152), comfortably ahead of vidore/colpali-v1.3 and Metric-AI/colqwen2.5-3b-multilingual.
2.3 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 8 GB GPU at bf16 inference.
Apache 2.0, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

Region pooling — visual tokens are grouped into 4-token regions before routing.
Query-conditioned latent gating (GateScalars) — router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page.
Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation.

Router sits at backbone layer −5.

Model details

Property	Value
Base model	`Qwen/Qwen3.5-VL-2B-Instruct`
Total parameters	2.32 B
Per-token embedding dim	1024
Max visual tokens / page	2048
Max text tokens	32 768
Similarity function	MaxSim
MoE specialists	4 latent + 1 shared dense
Top-k experts per token	2
Region size	4
Router placement	backbone layer −5
Weight precision (this release)	bfloat16
Adapted from	`DataScience-UIBK/Argus-Colqwen3.5-2b-v0` (fp32 merge)
License	Apache 2.0
Model size on disk	~4.6 GB
VRAM @ bf16 inference	~5.5 GB

Why two dtypes (and why bf16 is essentially free at 2B)

Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.

In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
In fp32, both are exact.

For the 4B sibling, this merge cost shows up as ~0.1 pp on V1 and ~0.2 pp on V2. For the 2B model it does not — every per-task ViDoRe score is bit-identical to the fp32 sibling. We attribute this to the small 2B model not having "borderline" routing decisions that the bf16 cast can flip; the late-interaction MaxSim pooling further averages out the residual noise across many tokens.

So at the 2B scale the bf16 release is the strict winner: same scores, half the disk, faster I/O. We still publish fp32 for users who specifically need maximum-precision weights for downstream merging or quantisation.

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with mteb 2.12 on the published bf16 weights, side-by-side with the fp32 sibling and the 4B variants for transparency.

Task	2B fp32	2B bf16 (this)	4B fp32	4B bf16
ArxivQA	0.9027	0.9027	0.9095	0.9126
DocVQA	0.6747	0.6747	0.6770	0.6779
InfoVQA	0.9497	0.9497	0.9463	0.9447
ShiftProject	0.9133	0.9133	0.9470	0.9346
SyntheticDocQA-AI	0.9963	0.9963	0.9963	0.9926
SyntheticDocQA-Energy	0.9726	0.9726	0.9789	0.9750
SyntheticDocQA-Government	0.9729	0.9729	0.9779	0.9779
SyntheticDocQA-Healthcare	0.9926	0.9926	0.9963	0.9963
TabFQuAD	0.9336	0.9336	0.9533	0.9544
TatDQA	0.8403	0.8403	0.8480	0.8485
Average	0.9149	0.9149	0.9230	0.9214

ViDoRe v1 — 2B / 3B-class leaderboard comparison

Rank	Model	Params	dim	V1 avg
1	nomic-ai/colnomic-embed-multimodal-3b	3.0 B	128	0.916
2	Argus-Colqwen3.5-2b-v0-bf16 (this)	2.3 B	1024	0.9149
3	Metric-AI/colqwen2.5-3b-multilingual	3.1 B	128	0.892
4	vidore/colpali-v1.3	2.9 B	128	0.844

Argus matches nomic's 3B-class result at smaller scale and a wider per-token dim, and is the strongest sub-3B retriever published to date.

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task	2B fp32	2B bf16 (this)	4B fp32	4B bf16
BioMedicalLectures	0.6499	0.6499	0.6438	0.6349
ESGReports-HighLevel	0.6936	0.6936	0.6991	0.7079
ESGReports	0.5988	0.5988	0.6218	0.6175
EconomicsReports	0.5186	0.5186	0.5980	0.5918
Average	0.6152	0.6152	0.6407	0.6380

ViDoRe v2 — 2B / 3B-class context

Model	V2 avg
Argus-Colqwen3.5-2b-v0-bf16	0.6152
nomic-ai/colnomic-embed-multimodal-3b	0.616
Metric-AI/colqwen2.5-3b-multilingual	0.580

ViDoRe v3

Not yet evaluated for this release.

Storage cost

Model	Tokens/page	Dim	Bytes/page (bf16 embeddings)
Ops-Colqwen3-4B	1280	2560	6.6 MB
Argus-Colqwen3.5-4b-v0-bf16	2048	1024	4.2 MB
Argus-Colqwen3.5-2b-v0-bf16	2048	1024	4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b	1280	320	0.8 MB

Per-page storage is identical between the 2B and 4B Argus releases — both use the same 1024-d head and same 2048-token visual budget. The choice between them is inference cost (2B is ~50 % faster than 4B), not corpus storage.

Installation

pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation     # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,                  # this release ships in bf16
    attn_implementation="flash_attention_2",
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)

Reproduce ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})

Training

Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.

When to use 2B-bf16 vs the other variants

Use case	Recommendation
Smallest deployable artefact at the 2B scale	2B bf16 (this) ← strict winner: same scores as fp32, half the disk
Maximum precision at 2B for downstream merging / quantisation	2B fp32
Maximum recall on document QA / leaderboard parity	4B fp32
Latency-sensitive retrieval at the 4B scale	4B bf16

Limitations

English-dominant.
Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load. (At 2B this does not matter — the bf16 cast is numerically equivalent to fp32 on ViDoRe; it does matter at 4B.)
ViDoRe v3 not yet evaluated.

License

Apache 2.0, inherited from Qwen3.5-VL-2B-Instruct.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0},
}

Contact

Org: DataScience-UIBK, University of Innsbruck
Issues: open one on this repo's Community tab.

Downloads last month: 25

Safetensors

Model size

2B params

Tensor type

BF16

DataScience-UIBK
/

Argus-Colqwen3.5-2b-v0-bf16