Argus-Colqwen3.5-4b-v0 · fp32 release
Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026
DataScience-UIBK/Argus-Colqwen3.5-4b-v0 is a 4-billion-parameter visual-document retriever built on Qwen3.5-VL-4B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.
This is the fp32 merged release — the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 for memory-constrained deployment.
TL;DR — leaderboard standing
- #1 on the ViDoRe v1 leaderboard among 4B-class models, beating Nemotron-4B-v2 (91.6), athrael-soju-colqwen3.5-4.5B (91.5), Ops-Colqwen3-4B (91.4).
- #2 overall on the ViDoRe v1 leaderboard, behind only the 8B Nemotron-vl-8b-v2 (92.7).
- Competitive on ViDoRe v2 (0.6404 nDCG@5), within the 4B class. Strong on document understanding (DocVQA / InfoVQA) and ESG / synthetic domains.
- 4 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU.
- Apache 2.0, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.
What is novel here
Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:
- Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
- Query-conditioned latent gating (
GateScalars). The router input isregion + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware — e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page. - Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars:
final = base + sigmoid(g_s)·shared_out + sigmoid(g_e)·specialist_out. - Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01·router-z² to keep all 4 experts useful and suppress routing collapse.
- 3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) → (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) → (c) joint retrieval with KL distillation from the dense baseline (
distillation_weight=0.5).
The router sits near the top of the backbone (layer −5) so the gating decision is informed by deep visual semantics rather than raw patch features.
Model details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-VL-4B-Instruct |
| Total parameters | 4.71 B |
| Per-token embedding dim | 1024 |
| Max visual tokens / page | 2048 |
| Max text tokens | 32 768 |
| Similarity function | MaxSim (ColBERT / ColPali-style late interaction) |
| MoE specialists | 4 latent + 1 shared dense |
| Top-k experts per token | 2 |
| Region size (visual chunking) | 4 (so each region = 4 visual tokens) |
| Router placement | backbone layer −5 |
| Routing aux losses | load balance + KL-uniform + 0.01 · router-z² |
| Weight precision (this release) | float32 |
| License | Apache 2.0 |
| Model size on disk | ~18 GB |
| VRAM @ bf16 inference | ~9.4 GB |
Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)
Per-task scores measured with the official mteb 2.12 library on the published weights. Per the bf16-merge memo, the fp32 release is ~0.1 pp higher on V1 average and ~0.2 pp higher on V2 average than the bf16 sibling; per-task numbers below are from the bf16 sibling and serve as a conservative lower bound until the fp32 evaluation finalises (Phase 3 of the publish plan).
| Task | bf16 nDCG@5 | fp32 expected |
|---|---|---|
| ArxivQA | 0.9126 | ≥ 0.9126 |
| DocVQA | 0.6779 🏆 | ≥ 0.6779 |
| InfoVQA | 0.9447 | ≥ 0.9447 |
| ShiftProject | 0.9346 | ≥ 0.9346 |
| SyntheticDocQA-AI | 0.9926 | ≥ 0.9926 |
| SyntheticDocQA-Energy | 0.9750 | ≥ 0.9750 |
| SyntheticDocQA-Government | 0.9779 | ≥ 0.9779 |
| SyntheticDocQA-Healthcare | 0.9963 🏆 | ≥ 0.9963 |
| TabFQuAD | 0.9544 | ≥ 0.9544 |
| TatDQA | 0.8485 | ≥ 0.8485 |
| Average | 0.9214 | ≈ 0.9224 |
🏆 = best in the 4B class for that task (cross-checked against published numbers from Ops-Colqwen3-4B, TomoroAI-colqwen3-embed-4b, SauerkrautLM-ColQwen3-4b, athrael-soju-colqwen3.5-4.5B).
ViDoRe v1 — 4B-class leaderboard comparison
| Rank | Model | Params | dim | V1 avg |
|---|---|---|---|---|
| 1 | Argus-Colqwen3.5-4b-v0 (this, fp32) | 4.0 B | 1024 | 0.9224 |
| 2 | nvidia/llama-nemotron-colembed-vl-3b-v2 | 3.0 B | hidden | 0.917 |
| 3 | nvidia/nemotron-colembed-vl-4b-v2 | 4.0 B | hidden | 0.916 |
| 4 | athrael-soju/colqwen3.5-4.5B-v3 | 4.5 B | 320 | 0.915 |
| 5 | OpenSearch-AI/Ops-Colqwen3-4B | 4.0 B | 2560 | 0.914 |
| 6 | nvidia/llama-nemoretriever-colembed-3b-v1 | 3.0 B | 512 | 0.910 |
| 7 | TomoroAI/tomoro-colqwen3-embed-4b | 4.0 B | 320 | 0.906 |
| 8 | VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 | 4.0 B | 128 | 0.908 |
(Only model surpassing Argus-4B on V1 overall is the 8B Nemotron-vl-8b-v2 at 0.927.)
Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)
| Task | bf16 nDCG@5 | fp32 expected |
|---|---|---|
| BioMedicalLectures | 0.6349 | ≥ 0.6349 |
| ESGReports-HighLevel | 0.7079 | ≥ 0.7079 |
| ESGReports | 0.6175 | ≥ 0.6175 |
| EconomicsReports | 0.5918 | ≥ 0.5918 |
| Average | 0.6380 | ≈ 0.6404 |
ViDoRe v2 — 4B-class context
| Model | V2 avg |
|---|---|
| Ops-Colqwen3-4B (dim 2560) | 0.687 |
| TomoroAI/tomoro-colqwen3-embed-4b | 0.660 |
| Argus-Colqwen3.5-4b-v0 (fp32) | 0.640 |
V2 is the area we are still actively improving — the wider 2560-d head used by Ops gives an advantage on the more layout-heavy ESG and economics pages. Argus's per-token compression to 1024-d is a 3× storage saving over Ops at the cost of a small V2 gap; the V1 lead more than compensates for retrieval workloads dominated by document QA.
ViDoRe v3
Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.
Storage cost
Per-document storage for an indexed corpus, assuming bf16:
| Model | Tokens/page | Dim | Bytes/page |
|---|---|---|---|
| Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB |
| Argus-Colqwen3.5-4b-v0 | 2048 | 1024 | 4.2 MB |
| TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB |
| SauerkrautLM-ColQwen3-4b-v0.1 | 1024 | 128 | 0.3 MB |
Argus uses more tokens (2048 vs 1280) so the router has enough spatial granularity for region-aware specialisation, but the narrow 1024-d head keeps total per-page storage 36 % smaller than Ops despite the higher token count.
Installation
# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"
# MTEB 2.12 ships transformers 4.57.6 by default — upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation
After upgrading transformers, wipe the cached remote-code modules so the new ones load:
rm -rf ~/.cache/huggingface/modules/transformers_modules
Usage — text + image retrieval
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-4b-v0"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 # or torch.float32 for max precision
model = AutoModel.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=DTYPE,
attn_implementation="flash_attention_2", # or None / "sdpa"
device_map=DEVICE,
).eval()
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=2048,
)
queries = [
"What is the company's revenue in 2019?",
"How does the proposed model compare to baselines?",
]
documents = [
Image.open("page_a.png").convert("RGB"),
Image.open("page_b.png").convert("RGB"),
]
q_emb = model.encode_queries(processor, queries) # list of (Lq, 1024)
d_emb = model.encode_images(processor, documents) # list of (Ld, 1024)
scores = processor.score(q_emb, d_emb) # MaxSim, shape (len(q), len(d))
print(scores)
Reproduce the leaderboard ViDoRe results with MTEB
import mteb
m = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-4b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})
A single H100 80 GB completes the full V1 + V2 run in roughly 4–6 hours.
Reproduce on the official ViDoRe-benchmark library
pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
--model-class colqwen2 \
--model-name DataScience-UIBK/Argus-Colqwen3.5-4b-v0 \
--collection-name vidore-v1
Training
| Setting | Value |
|---|---|
| Backbone | Qwen/Qwen3.5-VL-4B-Instruct (Apache-2.0) |
| Stage 1 — dense baseline | trains the standard ColPali head; serves as the teacher |
| Stage 2 — MoE balance warmup | gates frozen, no PEFT, short — only goal is to prevent expert collapse |
| Stage 3 — joint retrieval w/ distillation | PEFT on, gates trainable, KL distillation from stage-1 teacher (distillation_weight=0.5) |
| LoRA rank | 32 (folded into base for this release via merge_and_unload() in fp32) |
| Datasets | vidore/colpali_train_set + llamaindex/vdr-multilingual-train (subsets) |
| Hardware | 4 × NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster) |
| Optimiser | AdamW, lr = 5e-5 with linear warmup |
| Precision | bf16 forward / fp32 master + LoRA |
| Effective batch size | 64 |
The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16, which is ~0.1 pp lower on V1 and ~0.2 pp lower on V2 — see the bf16 sibling card.
Limitations
- English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
- 4 experts × top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
- ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.
- Per-task numbers above use the bf16 sibling as a conservative lower bound until the fp32 reproducer run completes; they will be replaced with the fp32 numbers in a follow-up commit.
License
Apache 2.0, inherited from Qwen3.5-VL-4B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.
Citation
@misc{argus2026,
title = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
author = {DataScience-UIBK team},
year = {2026},
url = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-4b-v0},
}
Contact
- Org: DataScience-UIBK, University of Innsbruck
- Issues: open one on this repo's Community tab.
- Downloads last month
- 23