Argus-Colqwen3.5-4b-v0 · bf16 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-4b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-4b-v0; only the LoRA-merge dtype differs.

The bf16 release is half the disk size (8.8 GB vs 18 GB), faster to download, and easier to deploy on memory-constrained GPUs — at the cost of ~0.1 pp on ViDoRe V1 and ~0.2 pp on V2 (merge-time precision loss).

	Disk	V1 avg nDCG@5	V2 avg nDCG@5
fp32 sibling (`-v0`)	18 GB	0.9224	0.6404
this release (`-v0-bf16`)	8.8 GB	0.9214	0.6380
Δ vs fp32	−9.2 GB	−0.0010	−0.0024

Use this bf16 release if you want the smallest deployable artefact and can tolerate sub-percent score loss. Use the fp32 sibling if you need leaderboard-grade reproducibility or you're going to merge / quantise further downstream.

TL;DR — leaderboard standing

#1 on the ViDoRe v1 leaderboard among 4B-class models (V1 = 0.9214) — beats Nemotron-4B-v2 (0.916), athrael-soju-colqwen3.5-4.5B (0.915), Ops-Colqwen3-4B (0.914).
#2 overall on the ViDoRe v1 leaderboard, behind only the 8B Nemotron-vl-8b-v2 (0.927).
Competitive on ViDoRe v2 (0.6380 nDCG@5) within the 4B class.
4 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 12 GB GPU at bf16.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

Region pooling — visual tokens are grouped into 4-token regions before routing.
Query-conditioned latent gating (GateScalars) — router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page.
Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation.

Router sits at backbone layer −5.

Model details

Property	Value
Base model	`Qwen/Qwen3.5-VL-4B-Instruct`
Total parameters	4.71 B
Per-token embedding dim	1024
Max visual tokens / page	2048
Max text tokens	32 768
Similarity function	MaxSim
MoE specialists	4 latent + 1 shared dense
Top-k experts per token	2
Region size	4
Router placement	backbone layer −5
Weight precision (this release)	bfloat16
Adapted from	`DataScience-UIBK/Argus-Colqwen3.5-4b-v0` (fp32 merge)
License	Apache 2.0
Model size on disk	~8.8 GB
VRAM @ bf16 inference	~9.4 GB

Why two dtypes?

Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.

In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
In fp32, both are exact.

We publish both:

fp32 preserves trained precision exactly. Larger disk, identical inference behaviour to the fp32 master used during evaluation.
bf16 is the dtype most users will load anyway (torch_dtype=torch.bfloat16). Smaller disk, the merge precision loss is baked in.

If your downstream pipeline loads in bf16, the inference-time difference between the two repos is tiny — but the bf16 release saves bandwidth and disk. Use the fp32 sibling if you load in fp32 / fp64 (rare) or you're going to quantise further.

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Real per-task scores, measured with mteb 2.12 on the published bf16 weights.

Task	nDCG@5
ArxivQA	0.9126
DocVQA	0.6779 🏆
InfoVQA	0.9447
ShiftProject	0.9346
SyntheticDocQA-AI	0.9926
SyntheticDocQA-Energy	0.9750
SyntheticDocQA-Government	0.9779
SyntheticDocQA-Healthcare	0.9963 🏆
TabFQuAD	0.9544
TatDQA	0.8485
Average	0.9214

🏆 = best in the 4B class for that task (cross-checked against published numbers from Ops-Colqwen3-4B, TomoroAI-colqwen3-embed-4b, SauerkrautLM-ColQwen3-4b, athrael-soju-colqwen3.5-4.5B).

ViDoRe v1 — 4B-class leaderboard comparison

Rank	Model	Params	dim	V1 avg
1	Argus-Colqwen3.5-4b-v0-bf16 (this)	4.0 B	1024	0.9214
2	nvidia/llama-nemotron-colembed-vl-3b-v2	3.0 B	hidden	0.917
3	nvidia/nemotron-colembed-vl-4b-v2	4.0 B	hidden	0.916
4	athrael-soju/colqwen3.5-4.5B-v3	4.5 B	320	0.915
5	OpenSearch-AI/Ops-Colqwen3-4B	4.0 B	2560	0.914
6	nvidia/llama-nemoretriever-colembed-3b-v1	3.0 B	512	0.910
7	TomoroAI/tomoro-colqwen3-embed-4b	4.0 B	320	0.906
8	VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1	4.0 B	128	0.908

(Only model surpassing Argus-4B on V1 overall is the 8B Nemotron-vl-8b-v2 at 0.927.)

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task	nDCG@5
BioMedicalLectures	0.6349
ESGReports-HighLevel	0.7079
ESGReports	0.6175
EconomicsReports	0.5918
Average	0.6380

ViDoRe v2 — 4B-class context

Model	V2 avg
Ops-Colqwen3-4B (dim 2560)	0.687
TomoroAI/tomoro-colqwen3-embed-4b	0.660
Argus-Colqwen3.5-4b-v0-bf16	0.638

V2 is the area we are still actively improving — Ops's wider 2560-d head pulls ahead on layout-heavy ESG / economics pages. Argus's per-token compression to 1024-d is a 3× storage saving over Ops at the cost of a small V2 gap; the V1 lead more than compensates for retrieval workloads dominated by document QA.

ViDoRe v3

Not yet evaluated for this release.

Storage cost

Model	Tokens/page	Dim	Bytes/page (bf16)
Ops-Colqwen3-4B	1280	2560	6.6 MB
Argus-Colqwen3.5-4b-v0-bf16	2048	1024	4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b	1280	320	0.8 MB
SauerkrautLM-ColQwen3-4b-v0.1	1024	128	0.3 MB

Installation

pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation     # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,                  # this release ships in bf16
    attn_implementation="flash_attention_2",
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)

Reproduce ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})

Training

Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.

Limitations

English-dominant.
Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load.
ViDoRe v3 not yet evaluated.
~0.1 pp / ~0.2 pp behind the fp32 sibling on V1 / V2 — use that one if leaderboard parity matters.

License

Apache 2.0, inherited from Qwen3.5-VL-4B-Instruct.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-4b-v0},
}

Contact

Org: DataScience-UIBK, University of Innsbruck
Issues: open one on this repo's Community tab.

Downloads last month: 31

Safetensors

Model size

5B params

Tensor type

BF16

DataScience-UIBK
/

Argus-Colqwen3.5-4b-v0-bf16