Argus-Colqwen3.5-9b-v0 · fp32 release
Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026
DataScience-UIBK/Argus-Colqwen3.5-9b-v0 is an 8.8-billion-parameter visual-document retriever built on Qwen3.5-VL-9B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.
This is the fp32 merged release — the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 for memory-constrained deployment. Smaller siblings: 4B fp32, 2B fp32.
TL;DR — leaderboard standing
- Co-leads the ViDoRe v1 leaderboard at V1 = 0.9267 — tied with
nvidia/nemotron-vl-8b-v2(0.927) within rounding noise, ahead of every other public retriever. - Best Argus result on ViDoRe v2 (V2 = 0.6915), a +0.05 jump over the 4B sibling and well ahead of the strongest 4B-class peers.
- 8.8 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU at bf16 inference.
- Apache 2.0, trained on public ViDoRe + VDR-Multilingual subsets only.
What is novel here
Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:
- Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
- Query-conditioned latent gating (
GateScalars). The router input isregion + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware — e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page. - Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars:
final = base + sigmoid(g_s)·shared_out + sigmoid(g_e)·specialist_out. - Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01·router-z² to keep all 4 experts useful and suppress routing collapse.
- 3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) → (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) → (c) joint retrieval with KL distillation from the dense baseline (
distillation_weight=0.5).
For the 9B release, the joint stage was extended on the larger VDR1.5M + Docmatix mixture (vdr_docmatix_full), giving the MoE more diverse layouts to specialise on.
The router sits near the top of the backbone (layer −5) so the gating decision is informed by deep visual semantics rather than raw patch features.
Model details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-VL-9B-Instruct |
| Total parameters | 8.82 B |
| Per-token embedding dim | 1024 |
| Max visual tokens / page | 2048 |
| Max text tokens | 32 768 |
| Similarity function | MaxSim (ColBERT / ColPali-style late interaction) |
| MoE specialists | 4 latent + 1 shared dense |
| Top-k experts per token | 2 |
| Region size (visual chunking) | 4 (so each region = 4 visual tokens) |
| Router placement | backbone layer −5 |
| Routing aux losses | load balance + KL-uniform + 0.01 · router-z² |
| Weight precision (this release) | float32 |
| License | Apache 2.0 |
| Model size on disk | ~33 GB |
| VRAM @ bf16 inference | ~17 GB |
Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)
Per-task scores measured with the official mteb 2.12 library on the published weights, side-by-side with every Argus sibling for transparency.
| Task | 2B fp32 | 2B bf16 | 4B fp32 | 4B bf16 | 9B fp32 (this) | 9B bf16 |
|---|---|---|---|---|---|---|
| ArxivQA | 0.9027 | 0.9027 | 0.9095 | 0.9126 | 0.9228 | 0.9217 |
| DocVQA | 0.6747 | 0.6747 | 0.6770 | 0.6779 | 0.6809 | 0.6826 |
| InfoVQA | 0.9497 | 0.9497 | 0.9463 | 0.9447 | 0.9426 | 0.9449 |
| ShiftProject | 0.9133 | 0.9133 | 0.9470 | 0.9346 | 0.9365 | 0.9298 |
| SyntheticDocQA-AI | 0.9963 | 0.9963 | 0.9963 | 0.9926 | 0.9963 | 0.9926 |
| SyntheticDocQA-Energy | 0.9726 | 0.9726 | 0.9789 | 0.9750 | 0.9732 | 0.9769 |
| SyntheticDocQA-Government | 0.9729 | 0.9729 | 0.9779 | 0.9779 | 0.9889 | 0.9889 |
| SyntheticDocQA-Healthcare | 0.9926 | 0.9926 | 0.9963 | 0.9963 | 0.9963 | 0.9926 |
| TabFQuAD | 0.9336 | 0.9336 | 0.9533 | 0.9544 | 0.9750 | 0.9724 |
| TatDQA | 0.8403 | 0.8403 | 0.8480 | 0.8485 | 0.8545 | 0.8567 |
| Average | 0.9149 | 0.9149 | 0.9230 | 0.9214 | 0.9267 | 0.9259 |
The 9B model leads on 6 of 10 V1 tasks and ties on most of the rest. The 4B sibling still wins on ShiftProject + SyntheticDocQA-Energy (~0.005–0.010 — at noise level). The 2B sibling has a small edge on InfoVQA — likely a regularisation effect on smaller backbones for layout-driven QA.
ViDoRe v1 — overall leaderboard comparison
| Rank | Model | Params | dim | V1 avg |
|---|---|---|---|---|
| 1 | Argus-Colqwen3.5-9b-v0 (this, fp32) | 8.8 B | 1024 | 0.9267 |
| 1 | nvidia/nemotron-vl-8b-v2 | 8.0 B | hidden | 0.927 |
| 3 | Argus-Colqwen3.5-4b-v0 (sibling, fp32) | 4.0 B | 1024 | 0.9230 |
| 4 | nvidia/llama-nemotron-colembed-vl-3b-v2 | 3.0 B | hidden | 0.917 |
| 5 | nvidia/nemotron-colembed-vl-4b-v2 | 4.0 B | hidden | 0.916 |
| 6 | athrael-soju/colqwen3.5-4.5B-v3 | 4.5 B | 320 | 0.915 |
| 7 | OpenSearch-AI/Ops-Colqwen3-4B | 4.0 B | 2560 | 0.914 |
| 8 | Argus-Colqwen3.5-2b-v0 (sibling, fp32) | 2.3 B | 1024 | 0.9149 |
(0.9267 vs 0.927 is +0.0003 — within rounding/eval-noise of a tie. Argus also wins by a clearer margin on V2; see below.)
Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)
| Task | 2B fp32 | 2B bf16 | 4B fp32 | 4B bf16 | 9B fp32 (this) | 9B bf16 |
|---|---|---|---|---|---|---|
| BioMedicalLectures | 0.6499 | 0.6499 | 0.6438 | 0.6349 | 0.6619 | 0.6633 |
| ESGReports-HighLevel | 0.6936 | 0.6936 | 0.6991 | 0.7079 | 0.7905 | 0.7912 |
| ESGReports | 0.5988 | 0.5988 | 0.6218 | 0.6175 | 0.6760 | 0.6764 |
| EconomicsReports | 0.5186 | 0.5186 | 0.5980 | 0.5918 | 0.6377 | 0.6278 |
| Average | 0.6152 | 0.6152 | 0.6407 | 0.6380 | 0.6915 | 0.6897 |
The V2 jump from 4B to 9B (+0.05 on average) is the largest improvement we see across the Argus family — the bigger backbone helps on layout-heavy ESG reports + dense numeric economics pages where the 4B was visibly behind Ops-Colqwen3-4B.
ViDoRe v2 — overall context
| Model | V2 avg |
|---|---|
| Argus-Colqwen3.5-9b-v0 (fp32, this) | 0.6915 |
| Ops-Colqwen3-4B (dim 2560) | 0.687 |
| TomoroAI/tomoro-colqwen3-embed-4b | 0.660 |
| Argus-Colqwen3.5-4b-v0 (sibling, fp32) | 0.6407 |
| Argus-Colqwen3.5-2b-v0 (sibling, fp32) | 0.6152 |
Argus 9B is the first sub-10B retriever to clear V2 = 0.69 while keeping the per-token embedding at 1024-d (vs Ops's 2560-d, a 2.5× storage cost).
ViDoRe v3
Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.
Storage cost
Per-document storage for an indexed corpus, assuming bf16 token embeddings:
| Model | Tokens/page | Dim | Bytes/page |
|---|---|---|---|
| Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB |
| Argus-Colqwen3.5-9b-v0 | 2048 | 1024 | 4.2 MB |
| Argus-Colqwen3.5-4b-v0 | 2048 | 1024 | 4.2 MB |
| Argus-Colqwen3.5-2b-v0 | 2048 | 1024 | 4.2 MB |
| TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB |
Per-page corpus storage is identical across the Argus family — the choice is inference cost (9B is the slowest) and GPU memory, not corpus size on disk.
Installation
# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"
# MTEB 2.12 ships transformers 4.57.6 by default — upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation
After upgrading transformers, wipe the cached remote-code modules so the new ones load:
rm -rf ~/.cache/huggingface/modules/transformers_modules
Usage — text + image retrieval
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 # or torch.float32 for max precision
model = AutoModel.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=DTYPE,
attn_implementation="flash_attention_2", # or None / "sdpa"
device_map=DEVICE,
).eval()
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=2048,
)
queries = [
"What is the company's revenue in 2019?",
"How does the proposed model compare to baselines?",
]
documents = [
Image.open("page_a.png").convert("RGB"),
Image.open("page_b.png").convert("RGB"),
]
q_emb = model.encode_queries(processor, queries)
d_emb = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)
Reproduce the leaderboard ViDoRe results with MTEB
import mteb
m = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 2})
A single H100 80 GB completes the full V1 + V2 run in roughly 6–8 hours for the 9B fp32 (about 2× the 4B runtime). Use batch_size=2 for safety; 4 may OOM on 80 GB once activations + KV cache stack up.
Reproduce on the official ViDoRe-benchmark library
pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
--model-class colqwen2 \
--model-name DataScience-UIBK/Argus-Colqwen3.5-9b-v0 \
--collection-name vidore-v1
Training
| Setting | Value |
|---|---|
| Backbone | Qwen/Qwen3.5-VL-9B-Instruct (Apache-2.0) |
| Stage 1 — dense baseline | trains the standard ColPali head; serves as the teacher |
| Stage 2 — MoE balance warmup | gates frozen, no PEFT, short — only goal is to prevent expert collapse |
| Stage 3 — joint retrieval w/ distillation | PEFT on, gates trainable, KL distillation from stage-1 teacher (distillation_weight=0.5); train mix = vdr_docmatix_full (VDR1.5M + Docmatix) |
| LoRA rank | 32 (folded into base for this release via merge_and_unload() in fp32) |
| Datasets | vidore/colpali_train_set + llamaindex/vdr-multilingual-train (subsets) + Docmatix-IR (in-domain) |
| Hardware | 4 × NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster) |
| Optimiser | AdamW, lr = 5e-5 with linear warmup |
| Precision | bf16 forward / fp32 master + LoRA |
| Effective batch size | 64 |
The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16 — see the bf16 sibling card.
Limitations
- English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
- 4 experts × top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
- 9B at bf16 needs ~17 GB VRAM just for weights — single-GPU inference requires a ≥ 24 GB GPU.
- ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.
License
Apache 2.0, inherited from Qwen3.5-VL-9B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.
Citation
@misc{argus2026,
title = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
author = {DataScience-UIBK team},
year = {2026},
url = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-9b-v0},
}
Contact
- Org: DataScience-UIBK, University of Innsbruck
- Issues: open one on this repo's Community tab.
- Downloads last month
- 23