Argus-Colqwen3.5-9b-v0 · bf16 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-9b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-9b-v0; only the LoRA-merge dtype differs.

The bf16 release is half the disk size (17 GB vs 33 GB), faster to download, and easier to deploy on memory-constrained GPUs — at the cost of ~0.001 pp on V1 and ~0.002 pp on V2 (well within ViDoRe eval noise).

Disk V1 avg nDCG@5 V2 avg nDCG@5
fp32 sibling (-v0) 33 GB 0.9267 0.6915
this release (-v0-bf16) 17 GB 0.9259 0.6897
Δ vs fp32 −16 GB −0.0008 −0.0018

Use this bf16 release if you want a smaller deployable artefact and can tolerate sub-percent score differences (almost every deployment scenario). Use the fp32 sibling if you need leaderboard-grade reproducibility or you're going to merge / quantise further downstream.

TL;DR — leaderboard standing

  • Co-leads the ViDoRe v1 leaderboard at V1 = 0.9259 — within 0.001 of the 8B Nemotron-vl-8b-v2 (0.927) and ahead of every other public retriever.
  • V2 = 0.6897 — best Argus result on V2, +0.05 over the 4B sibling.
  • 8.8 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU at bf16 inference.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

  1. Region pooling — visual tokens are grouped into 4-token regions before routing.
  2. Query-conditioned latent gating (GateScalars) — router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page.
  3. Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
  4. Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
  5. 3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation. The 9B joint stage was extended on VDR1.5M + Docmatix (vdr_docmatix_full).

Router sits at backbone layer −5.

Model details

Property Value
Base model Qwen/Qwen3.5-VL-9B-Instruct
Total parameters 8.82 B
Per-token embedding dim 1024
Max visual tokens / page 2048
Max text tokens 32 768
Similarity function MaxSim
MoE specialists 4 latent + 1 shared dense
Top-k experts per token 2
Region size 4
Router placement backbone layer −5
Weight precision (this release) bfloat16
Adapted from DataScience-UIBK/Argus-Colqwen3.5-9b-v0 (fp32 merge)
License Apache 2.0
Model size on disk ~17 GB
VRAM @ bf16 inference ~17 GB

Why two dtypes?

Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.

  • In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
  • In fp32, both are exact.

For the 9B release the merge cost is ~0.001 V1 / ~0.002 V2 — within the noise floor of a single ViDoRe run. We publish both:

  • fp32 preserves trained precision exactly. Larger disk, identical inference behaviour to the fp32 master used during evaluation.
  • bf16 is the dtype most users will load anyway. Half the disk, the merge precision loss is baked in but immaterial in practice.

If your downstream pipeline loads in bf16, the bf16 release is the strict win — same scores within rounding, half the disk.

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Real per-task scores, measured with mteb 2.12 on the published bf16 weights, side-by-side with every Argus sibling.

Task 2B fp32 2B bf16 4B fp32 4B bf16 9B fp32 9B bf16 (this)
ArxivQA 0.9027 0.9027 0.9095 0.9126 0.9228 0.9217
DocVQA 0.6747 0.6747 0.6770 0.6779 0.6809 0.6826
InfoVQA 0.9497 0.9497 0.9463 0.9447 0.9426 0.9449
ShiftProject 0.9133 0.9133 0.9470 0.9346 0.9365 0.9298
SyntheticDocQA-AI 0.9963 0.9963 0.9963 0.9926 0.9963 0.9926
SyntheticDocQA-Energy 0.9726 0.9726 0.9789 0.9750 0.9732 0.9769
SyntheticDocQA-Government 0.9729 0.9729 0.9779 0.9779 0.9889 0.9889
SyntheticDocQA-Healthcare 0.9926 0.9926 0.9963 0.9963 0.9963 0.9926
TabFQuAD 0.9336 0.9336 0.9533 0.9544 0.9750 0.9724
TatDQA 0.8403 0.8403 0.8480 0.8485 0.8545 0.8567
Average 0.9149 0.9149 0.9230 0.9214 0.9267 0.9259

ViDoRe v1 — overall leaderboard comparison

Rank Model Params dim V1 avg
1 nvidia/nemotron-vl-8b-v2 8.0 B hidden 0.927
1 Argus-Colqwen3.5-9b-v0 (fp32 sibling) 8.8 B 1024 0.9267
3 Argus-Colqwen3.5-9b-v0-bf16 (this) 8.8 B 1024 0.9259
4 Argus-Colqwen3.5-4b-v0 (sibling, fp32) 4.0 B 1024 0.9230
5 nvidia/llama-nemotron-colembed-vl-3b-v2 3.0 B hidden 0.917
6 nvidia/nemotron-colembed-vl-4b-v2 4.0 B hidden 0.916
7 athrael-soju/colqwen3.5-4.5B-v3 4.5 B 320 0.915
8 OpenSearch-AI/Ops-Colqwen3-4B 4.0 B 2560 0.914

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task 2B fp32 2B bf16 4B fp32 4B bf16 9B fp32 9B bf16 (this)
BioMedicalLectures 0.6499 0.6499 0.6438 0.6349 0.6619 0.6633
ESGReports-HighLevel 0.6936 0.6936 0.6991 0.7079 0.7905 0.7912
ESGReports 0.5988 0.5988 0.6218 0.6175 0.6760 0.6764
EconomicsReports 0.5186 0.5186 0.5980 0.5918 0.6377 0.6278
Average 0.6152 0.6152 0.6407 0.6380 0.6915 0.6897

ViDoRe v2 — overall context

Model V2 avg
Argus-Colqwen3.5-9b-v0 (fp32 sibling) 0.6915
Argus-Colqwen3.5-9b-v0-bf16 (this) 0.6897
Ops-Colqwen3-4B (dim 2560) 0.687
TomoroAI/tomoro-colqwen3-embed-4b 0.660
Argus-Colqwen3.5-4b-v0 (sibling, fp32) 0.6407

ViDoRe v3

Not yet evaluated for this release.

Storage cost

Model Tokens/page Dim Bytes/page (bf16)
Ops-Colqwen3-4B 1280 2560 6.6 MB
Argus-Colqwen3.5-9b-v0-bf16 2048 1024 4.2 MB
Argus-Colqwen3.5-4b-v0-bf16 2048 1024 4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b 1280 320 0.8 MB

Installation

pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation     # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,                  # this release ships in bf16
    attn_implementation="flash_attention_2",
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)

Reproduce ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 2})

Training

Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.

When to use which Argus variant

Use case Recommendation
Smallest deployable artefact at the 9B scale 9B bf16 (this) — strict winner over 9B fp32 at half the disk
Maximum precision at 9B for downstream merging / quantisation 9B fp32
Latency-sensitive retrieval / 24 GB GPU budget 4B bf16
Smallest model still in the 0.91+ V1 league 2B bf16 (4.6 GB)

Limitations

  • English-dominant.
  • Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load.
  • ViDoRe v3 not yet evaluated.
  • Needs ~17 GB VRAM at bf16 inference; single-GPU users need ≥ 24 GB.

License

Apache 2.0, inherited from Qwen3.5-VL-9B-Instruct.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-9b-v0},
}

Contact

  • Org: DataScience-UIBK, University of Innsbruck
  • Issues: open one on this repo's Community tab.
Downloads last month
291
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16

Space using DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 1