Argus-Colqwen3.5-9b-v0 · fp32 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-9b-v0 is an 8.8-billion-parameter visual-document retriever built on Qwen3.5-VL-9B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.

This is the fp32 merged release — the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 for memory-constrained deployment. Smaller siblings: 4B fp32, 2B fp32.

TL;DR — leaderboard standing

  • Co-leads the ViDoRe v1 leaderboard at V1 = 0.9267 — tied with nvidia/nemotron-vl-8b-v2 (0.927) within rounding noise, ahead of every other public retriever.
  • Best Argus result on ViDoRe v2 (V2 = 0.6915), a +0.05 jump over the 4B sibling and well ahead of the strongest 4B-class peers.
  • 8.8 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU at bf16 inference.
  • Apache 2.0, trained on public ViDoRe + VDR-Multilingual subsets only.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

  1. Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
  2. Query-conditioned latent gating (GateScalars). The router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware — e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page.
  3. Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars: final = base + sigmoid(g_s)·shared_out + sigmoid(g_e)·specialist_out.
  4. Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01·router-z² to keep all 4 experts useful and suppress routing collapse.
  5. 3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) → (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) → (c) joint retrieval with KL distillation from the dense baseline (distillation_weight=0.5).

For the 9B release, the joint stage was extended on the larger VDR1.5M + Docmatix mixture (vdr_docmatix_full), giving the MoE more diverse layouts to specialise on.

The router sits near the top of the backbone (layer −5) so the gating decision is informed by deep visual semantics rather than raw patch features.

Model details

Property Value
Base model Qwen/Qwen3.5-VL-9B-Instruct
Total parameters 8.82 B
Per-token embedding dim 1024
Max visual tokens / page 2048
Max text tokens 32 768
Similarity function MaxSim (ColBERT / ColPali-style late interaction)
MoE specialists 4 latent + 1 shared dense
Top-k experts per token 2
Region size (visual chunking) 4 (so each region = 4 visual tokens)
Router placement backbone layer −5
Routing aux losses load balance + KL-uniform + 0.01 · router-z²
Weight precision (this release) float32
License Apache 2.0
Model size on disk ~33 GB
VRAM @ bf16 inference ~17 GB

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with the official mteb 2.12 library on the published weights, side-by-side with every Argus sibling for transparency.

Task 2B fp32 2B bf16 4B fp32 4B bf16 9B fp32 (this) 9B bf16
ArxivQA 0.9027 0.9027 0.9095 0.9126 0.9228 0.9217
DocVQA 0.6747 0.6747 0.6770 0.6779 0.6809 0.6826
InfoVQA 0.9497 0.9497 0.9463 0.9447 0.9426 0.9449
ShiftProject 0.9133 0.9133 0.9470 0.9346 0.9365 0.9298
SyntheticDocQA-AI 0.9963 0.9963 0.9963 0.9926 0.9963 0.9926
SyntheticDocQA-Energy 0.9726 0.9726 0.9789 0.9750 0.9732 0.9769
SyntheticDocQA-Government 0.9729 0.9729 0.9779 0.9779 0.9889 0.9889
SyntheticDocQA-Healthcare 0.9926 0.9926 0.9963 0.9963 0.9963 0.9926
TabFQuAD 0.9336 0.9336 0.9533 0.9544 0.9750 0.9724
TatDQA 0.8403 0.8403 0.8480 0.8485 0.8545 0.8567
Average 0.9149 0.9149 0.9230 0.9214 0.9267 0.9259

The 9B model leads on 6 of 10 V1 tasks and ties on most of the rest. The 4B sibling still wins on ShiftProject + SyntheticDocQA-Energy (~0.005–0.010 — at noise level). The 2B sibling has a small edge on InfoVQA — likely a regularisation effect on smaller backbones for layout-driven QA.

ViDoRe v1 — overall leaderboard comparison

Rank Model Params dim V1 avg
1 Argus-Colqwen3.5-9b-v0 (this, fp32) 8.8 B 1024 0.9267
1 nvidia/nemotron-vl-8b-v2 8.0 B hidden 0.927
3 Argus-Colqwen3.5-4b-v0 (sibling, fp32) 4.0 B 1024 0.9230
4 nvidia/llama-nemotron-colembed-vl-3b-v2 3.0 B hidden 0.917
5 nvidia/nemotron-colembed-vl-4b-v2 4.0 B hidden 0.916
6 athrael-soju/colqwen3.5-4.5B-v3 4.5 B 320 0.915
7 OpenSearch-AI/Ops-Colqwen3-4B 4.0 B 2560 0.914
8 Argus-Colqwen3.5-2b-v0 (sibling, fp32) 2.3 B 1024 0.9149

(0.9267 vs 0.927 is +0.0003 — within rounding/eval-noise of a tie. Argus also wins by a clearer margin on V2; see below.)

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task 2B fp32 2B bf16 4B fp32 4B bf16 9B fp32 (this) 9B bf16
BioMedicalLectures 0.6499 0.6499 0.6438 0.6349 0.6619 0.6633
ESGReports-HighLevel 0.6936 0.6936 0.6991 0.7079 0.7905 0.7912
ESGReports 0.5988 0.5988 0.6218 0.6175 0.6760 0.6764
EconomicsReports 0.5186 0.5186 0.5980 0.5918 0.6377 0.6278
Average 0.6152 0.6152 0.6407 0.6380 0.6915 0.6897

The V2 jump from 4B to 9B (+0.05 on average) is the largest improvement we see across the Argus family — the bigger backbone helps on layout-heavy ESG reports + dense numeric economics pages where the 4B was visibly behind Ops-Colqwen3-4B.

ViDoRe v2 — overall context

Model V2 avg
Argus-Colqwen3.5-9b-v0 (fp32, this) 0.6915
Ops-Colqwen3-4B (dim 2560) 0.687
TomoroAI/tomoro-colqwen3-embed-4b 0.660
Argus-Colqwen3.5-4b-v0 (sibling, fp32) 0.6407
Argus-Colqwen3.5-2b-v0 (sibling, fp32) 0.6152

Argus 9B is the first sub-10B retriever to clear V2 = 0.69 while keeping the per-token embedding at 1024-d (vs Ops's 2560-d, a 2.5× storage cost).

ViDoRe v3

Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.

Storage cost

Per-document storage for an indexed corpus, assuming bf16 token embeddings:

Model Tokens/page Dim Bytes/page
Ops-Colqwen3-4B 1280 2560 6.6 MB
Argus-Colqwen3.5-9b-v0 2048 1024 4.2 MB
Argus-Colqwen3.5-4b-v0 2048 1024 4.2 MB
Argus-Colqwen3.5-2b-v0 2048 1024 4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b 1280 320 0.8 MB

Per-page corpus storage is identical across the Argus family — the choice is inference cost (9B is the slowest) and GPU memory, not corpus size on disk.

Installation

# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"

# MTEB 2.12 ships transformers 4.57.6 by default — upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"

# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation

After upgrading transformers, wipe the cached remote-code modules so the new ones load:

rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage — text + image retrieval

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.bfloat16    # or torch.float32 for max precision

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=DTYPE,
    attn_implementation="flash_attention_2",   # or None / "sdpa"
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)

Reproduce the leaderboard ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 2})

A single H100 80 GB completes the full V1 + V2 run in roughly 6–8 hours for the 9B fp32 (about 2× the 4B runtime). Use batch_size=2 for safety; 4 may OOM on 80 GB once activations + KV cache stack up.

Reproduce on the official ViDoRe-benchmark library

pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
  --model-class colqwen2 \
  --model-name DataScience-UIBK/Argus-Colqwen3.5-9b-v0 \
  --collection-name vidore-v1

Training

Setting Value
Backbone Qwen/Qwen3.5-VL-9B-Instruct (Apache-2.0)
Stage 1 — dense baseline trains the standard ColPali head; serves as the teacher
Stage 2 — MoE balance warmup gates frozen, no PEFT, short — only goal is to prevent expert collapse
Stage 3 — joint retrieval w/ distillation PEFT on, gates trainable, KL distillation from stage-1 teacher (distillation_weight=0.5); train mix = vdr_docmatix_full (VDR1.5M + Docmatix)
LoRA rank 32 (folded into base for this release via merge_and_unload() in fp32)
Datasets vidore/colpali_train_set + llamaindex/vdr-multilingual-train (subsets) + Docmatix-IR (in-domain)
Hardware 4 × NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster)
Optimiser AdamW, lr = 5e-5 with linear warmup
Precision bf16 forward / fp32 master + LoRA
Effective batch size 64

The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16 — see the bf16 sibling card.

Limitations

  • English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
  • 4 experts × top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
  • 9B at bf16 needs ~17 GB VRAM just for weights — single-GPU inference requires a ≥ 24 GB GPU.
  • ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.

License

Apache 2.0, inherited from Qwen3.5-VL-9B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-9b-v0},
}

Contact

  • Org: DataScience-UIBK, University of Innsbruck
  • Issues: open one on this repo's Community tab.
Downloads last month
23
Safetensors
Model size
9B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train DataScience-UIBK/Argus-Colqwen3.5-9b-v0

Space using DataScience-UIBK/Argus-Colqwen3.5-9b-v0 1