Argus-Colqwen3.5-9b-v0 · bf16 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-9b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-9b-v0; only the LoRA-merge dtype differs.

The bf16 release is half the disk size (17 GB vs 33 GB), faster to download, and easier to deploy on memory-constrained GPUs — at the cost of ~0.001 pp on V1 and ~0.002 pp on V2 (well within ViDoRe eval noise).

	Disk	V1 avg nDCG@5	V2 avg nDCG@5
fp32 sibling (`-v0`)	33 GB	0.9267	0.6915
this release (`-v0-bf16`)	17 GB	0.9259	0.6897
Δ vs fp32	−16 GB	−0.0008	−0.0018

Use this bf16 release if you want a smaller deployable artefact and can tolerate sub-percent score differences (almost every deployment scenario). Use the fp32 sibling if you need leaderboard-grade reproducibility or you're going to merge / quantise further downstream.

TL;DR — leaderboard standing

Co-leads the ViDoRe v1 leaderboard at V1 = 0.9259 — within 0.001 of the 8B Nemotron-vl-8b-v2 (0.927) and ahead of every other public retriever.
V2 = 0.6897 — best Argus result on V2, +0.05 over the 4B sibling.
8.8 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU at bf16 inference.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

Region pooling — visual tokens are grouped into 4-token regions before routing.
Query-conditioned latent gating (GateScalars) — router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page.
Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation. The 9B joint stage was extended on VDR1.5M + Docmatix (vdr_docmatix_full).

Router sits at backbone layer −5.

Model details

Property	Value
Base model	`Qwen/Qwen3.5-VL-9B-Instruct`
Total parameters	8.82 B
Per-token embedding dim	1024
Max visual tokens / page	2048
Max text tokens	32 768
Similarity function	MaxSim
MoE specialists	4 latent + 1 shared dense
Top-k experts per token	2
Region size	4
Router placement	backbone layer −5
Weight precision (this release)	bfloat16
Adapted from	`DataScience-UIBK/Argus-Colqwen3.5-9b-v0` (fp32 merge)
License	Apache 2.0
Model size on disk	~17 GB
VRAM @ bf16 inference	~17 GB

Why two dtypes?

Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.

In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
In fp32, both are exact.

For the 9B release the merge cost is ~0.001 V1 / ~0.002 V2 — within the noise floor of a single ViDoRe run. We publish both:

fp32 preserves trained precision exactly. Larger disk, identical inference behaviour to the fp32 master used during evaluation.
bf16 is the dtype most users will load anyway. Half the disk, the merge precision loss is baked in but immaterial in practice.

If your downstream pipeline loads in bf16, the bf16 release is the strict win — same scores within rounding, half the disk.

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Real per-task scores, measured with mteb 2.12 on the published bf16 weights, side-by-side with every Argus sibling.

Task	2B fp32	2B bf16	4B fp32	4B bf16	9B fp32	9B bf16 (this)
ArxivQA	0.9027	0.9027	0.9095	0.9126	0.9228	0.9217
DocVQA	0.6747	0.6747	0.6770	0.6779	0.6809	0.6826
InfoVQA	0.9497	0.9497	0.9463	0.9447	0.9426	0.9449
ShiftProject	0.9133	0.9133	0.9470	0.9346	0.9365	0.9298
SyntheticDocQA-AI	0.9963	0.9963	0.9963	0.9926	0.9963	0.9926
SyntheticDocQA-Energy	0.9726	0.9726	0.9789	0.9750	0.9732	0.9769
SyntheticDocQA-Government	0.9729	0.9729	0.9779	0.9779	0.9889	0.9889
SyntheticDocQA-Healthcare	0.9926	0.9926	0.9963	0.9963	0.9963	0.9926
TabFQuAD	0.9336	0.9336	0.9533	0.9544	0.9750	0.9724
TatDQA	0.8403	0.8403	0.8480	0.8485	0.8545	0.8567
Average	0.9149	0.9149	0.9230	0.9214	0.9267	0.9259

ViDoRe v1 — overall leaderboard comparison

Rank	Model	Params	dim	V1 avg
1	nvidia/nemotron-vl-8b-v2	8.0 B	hidden	0.927
1	Argus-Colqwen3.5-9b-v0 (fp32 sibling)	8.8 B	1024	0.9267
3	Argus-Colqwen3.5-9b-v0-bf16 (this)	8.8 B	1024	0.9259
4	Argus-Colqwen3.5-4b-v0 (sibling, fp32)	4.0 B	1024	0.9230
5	nvidia/llama-nemotron-colembed-vl-3b-v2	3.0 B	hidden	0.917
6	nvidia/nemotron-colembed-vl-4b-v2	4.0 B	hidden	0.916
7	athrael-soju/colqwen3.5-4.5B-v3	4.5 B	320	0.915
8	OpenSearch-AI/Ops-Colqwen3-4B	4.0 B	2560	0.914

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task	2B fp32	2B bf16	4B fp32	4B bf16	9B fp32	9B bf16 (this)
BioMedicalLectures	0.6499	0.6499	0.6438	0.6349	0.6619	0.6633
ESGReports-HighLevel	0.6936	0.6936	0.6991	0.7079	0.7905	0.7912
ESGReports	0.5988	0.5988	0.6218	0.6175	0.6760	0.6764
EconomicsReports	0.5186	0.5186	0.5980	0.5918	0.6377	0.6278
Average	0.6152	0.6152	0.6407	0.6380	0.6915	0.6897

ViDoRe v2 — overall context

Model	V2 avg
Argus-Colqwen3.5-9b-v0 (fp32 sibling)	0.6915
Argus-Colqwen3.5-9b-v0-bf16 (this)	0.6897
Ops-Colqwen3-4B (dim 2560)	0.687
TomoroAI/tomoro-colqwen3-embed-4b	0.660
Argus-Colqwen3.5-4b-v0 (sibling, fp32)	0.6407

ViDoRe v3

Not yet evaluated for this release.

Storage cost

Model	Tokens/page	Dim	Bytes/page (bf16)
Ops-Colqwen3-4B	1280	2560	6.6 MB
Argus-Colqwen3.5-9b-v0-bf16	2048	1024	4.2 MB
Argus-Colqwen3.5-4b-v0-bf16	2048	1024	4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b	1280	320	0.8 MB

Installation

pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation     # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,                  # this release ships in bf16
    attn_implementation="flash_attention_2",
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)

Reproduce ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 2})

Training

Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.

When to use which Argus variant

Use case	Recommendation
Smallest deployable artefact at the 9B scale	9B bf16 (this) — strict winner over 9B fp32 at half the disk
Maximum precision at 9B for downstream merging / quantisation	9B fp32
Latency-sensitive retrieval / 24 GB GPU budget	4B bf16
Smallest model still in the 0.91+ V1 league	2B bf16 (4.6 GB)

Limitations

English-dominant.
Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load.
ViDoRe v3 not yet evaluated.
Needs ~17 GB VRAM at bf16 inference; single-GPU users need ≥ 24 GB.

License

Apache 2.0, inherited from Qwen3.5-VL-9B-Instruct.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-9b-v0},
}

Contact

Org: DataScience-UIBK, University of Innsbruck
Issues: open one on this repo's Community tab.

Downloads last month: 291

Safetensors

Model size

9B params

Tensor type

BF16

DataScience-UIBK
/

Argus-Colqwen3.5-9b-v0-bf16