Argus-Colqwen3.5-4b-v0 · fp32 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-4b-v0 is a 4-billion-parameter visual-document retriever built on Qwen3.5-VL-4B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.

This is the fp32 merged release — the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 for memory-constrained deployment.

TL;DR — leaderboard standing

#1 on the ViDoRe v1 leaderboard among 4B-class models, beating Nemotron-4B-v2 (91.6), athrael-soju-colqwen3.5-4.5B (91.5), Ops-Colqwen3-4B (91.4).
#2 overall on the ViDoRe v1 leaderboard, behind only the 8B Nemotron-vl-8b-v2 (92.7).
Competitive on ViDoRe v2 (0.6404 nDCG@5), within the 4B class. Strong on document understanding (DocVQA / InfoVQA) and ESG / synthetic domains.
4 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU.
Apache 2.0, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
Query-conditioned latent gating (GateScalars). The router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware — e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page.
Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars: final = base + sigmoid(g_s)·shared_out + sigmoid(g_e)·specialist_out.
Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01·router-z² to keep all 4 experts useful and suppress routing collapse.
3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) → (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) → (c) joint retrieval with KL distillation from the dense baseline (distillation_weight=0.5).

The router sits near the top of the backbone (layer −5) so the gating decision is informed by deep visual semantics rather than raw patch features.

Model details

Property	Value
Base model	`Qwen/Qwen3.5-VL-4B-Instruct`
Total parameters	4.71 B
Per-token embedding dim	1024
Max visual tokens / page	2048
Max text tokens	32 768
Similarity function	MaxSim (ColBERT / ColPali-style late interaction)
MoE specialists	4 latent + 1 shared dense
Top-k experts per token	2
Region size (visual chunking)	4 (so each region = 4 visual tokens)
Router placement	backbone layer −5
Routing aux losses	load balance + KL-uniform + 0.01 · router-z²
Weight precision (this release)	float32
License	Apache 2.0
Model size on disk	~18 GB
VRAM @ bf16 inference	~9.4 GB

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with the official mteb 2.12 library on the published weights. Per the bf16-merge memo, the fp32 release is ~0.1 pp higher on V1 average and ~0.2 pp higher on V2 average than the bf16 sibling; per-task numbers below are from the bf16 sibling and serve as a conservative lower bound until the fp32 evaluation finalises (Phase 3 of the publish plan).

Task	bf16 nDCG@5	fp32 expected
ArxivQA	0.9126	≥ 0.9126
DocVQA	0.6779 🏆	≥ 0.6779
InfoVQA	0.9447	≥ 0.9447
ShiftProject	0.9346	≥ 0.9346
SyntheticDocQA-AI	0.9926	≥ 0.9926
SyntheticDocQA-Energy	0.9750	≥ 0.9750
SyntheticDocQA-Government	0.9779	≥ 0.9779
SyntheticDocQA-Healthcare	0.9963 🏆	≥ 0.9963
TabFQuAD	0.9544	≥ 0.9544
TatDQA	0.8485	≥ 0.8485
Average	0.9214	≈ 0.9224

🏆 = best in the 4B class for that task (cross-checked against published numbers from Ops-Colqwen3-4B, TomoroAI-colqwen3-embed-4b, SauerkrautLM-ColQwen3-4b, athrael-soju-colqwen3.5-4.5B).

ViDoRe v1 — 4B-class leaderboard comparison

Rank	Model	Params	dim	V1 avg
1	Argus-Colqwen3.5-4b-v0 (this, fp32)	4.0 B	1024	0.9224
2	nvidia/llama-nemotron-colembed-vl-3b-v2	3.0 B	hidden	0.917
3	nvidia/nemotron-colembed-vl-4b-v2	4.0 B	hidden	0.916
4	athrael-soju/colqwen3.5-4.5B-v3	4.5 B	320	0.915
5	OpenSearch-AI/Ops-Colqwen3-4B	4.0 B	2560	0.914
6	nvidia/llama-nemoretriever-colembed-3b-v1	3.0 B	512	0.910
7	TomoroAI/tomoro-colqwen3-embed-4b	4.0 B	320	0.906
8	VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1	4.0 B	128	0.908

(Only model surpassing Argus-4B on V1 overall is the 8B Nemotron-vl-8b-v2 at 0.927.)

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task	bf16 nDCG@5	fp32 expected
BioMedicalLectures	0.6349	≥ 0.6349
ESGReports-HighLevel	0.7079	≥ 0.7079
ESGReports	0.6175	≥ 0.6175
EconomicsReports	0.5918	≥ 0.5918
Average	0.6380	≈ 0.6404

ViDoRe v2 — 4B-class context

Model	V2 avg
Ops-Colqwen3-4B (dim 2560)	0.687
TomoroAI/tomoro-colqwen3-embed-4b	0.660
Argus-Colqwen3.5-4b-v0 (fp32)	0.640

V2 is the area we are still actively improving — the wider 2560-d head used by Ops gives an advantage on the more layout-heavy ESG and economics pages. Argus's per-token compression to 1024-d is a 3× storage saving over Ops at the cost of a small V2 gap; the V1 lead more than compensates for retrieval workloads dominated by document QA.

ViDoRe v3

Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.

Storage cost

Per-document storage for an indexed corpus, assuming bf16:

Model	Tokens/page	Dim	Bytes/page
Ops-Colqwen3-4B	1280	2560	6.6 MB
Argus-Colqwen3.5-4b-v0	2048	1024	4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b	1280	320	0.8 MB
SauerkrautLM-ColQwen3-4b-v0.1	1024	128	0.3 MB

Argus uses more tokens (2048 vs 1280) so the router has enough spatial granularity for region-aware specialisation, but the narrow 1024-d head keeps total per-page storage 36 % smaller than Ops despite the higher token count.

Installation

# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"

# MTEB 2.12 ships transformers 4.57.6 by default — upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"

# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation

After upgrading transformers, wipe the cached remote-code modules so the new ones load:

rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage — text + image retrieval

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-4b-v0"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.bfloat16    # or torch.float32 for max precision

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=DTYPE,
    attn_implementation="flash_attention_2",   # or None / "sdpa"
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)         # list of (Lq, 1024)
d_emb  = model.encode_images(processor, documents)         # list of (Ld, 1024)
scores = processor.score(q_emb, d_emb)                     # MaxSim, shape (len(q), len(d))
print(scores)

Reproduce the leaderboard ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-4b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})

A single H100 80 GB completes the full V1 + V2 run in roughly 4–6 hours.

Reproduce on the official ViDoRe-benchmark library

pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
  --model-class colqwen2 \
  --model-name DataScience-UIBK/Argus-Colqwen3.5-4b-v0 \
  --collection-name vidore-v1

Training

Setting	Value
Backbone	`Qwen/Qwen3.5-VL-4B-Instruct` (Apache-2.0)
Stage 1 — dense baseline	trains the standard ColPali head; serves as the teacher
Stage 2 — MoE balance warmup	gates frozen, no PEFT, short — only goal is to prevent expert collapse
Stage 3 — joint retrieval w/ distillation	PEFT on, gates trainable, KL distillation from stage-1 teacher (`distillation_weight=0.5`)
LoRA rank	32 (folded into base for this release via `merge_and_unload()` in fp32)
Datasets	`vidore/colpali_train_set` + `llamaindex/vdr-multilingual-train` (subsets)
Hardware	4 × NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster)
Optimiser	AdamW, lr = 5e-5 with linear warmup
Precision	bf16 forward / fp32 master + LoRA
Effective batch size	64

The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16, which is ~0.1 pp lower on V1 and ~0.2 pp lower on V2 — see the bf16 sibling card.

Limitations

English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
4 experts × top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.
Per-task numbers above use the bf16 sibling as a conservative lower bound until the fp32 reproducer run completes; they will be replaced with the fp32 numbers in a follow-up commit.

License

Apache 2.0, inherited from Qwen3.5-VL-4B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-4b-v0},
}

Contact

Org: DataScience-UIBK, University of Innsbruck
Issues: open one on this repo's Community tab.

Downloads last month: 23

Safetensors

Model size

5B params

Tensor type

F32

DataScience-UIBK
/

Argus-Colqwen3.5-4b-v0