Argus-Colqwen3.5-4b-v0 · fp32 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-4b-v0 is a 4-billion-parameter visual-document retriever built on Qwen3.5-VL-4B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.

This is the fp32 merged release — the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 for memory-constrained deployment.

TL;DR — leaderboard standing

  • #1 on the ViDoRe v1 leaderboard among 4B-class models, beating Nemotron-4B-v2 (91.6), athrael-soju-colqwen3.5-4.5B (91.5), Ops-Colqwen3-4B (91.4).
  • #2 overall on the ViDoRe v1 leaderboard, behind only the 8B Nemotron-vl-8b-v2 (92.7).
  • Competitive on ViDoRe v2 (0.6404 nDCG@5), within the 4B class. Strong on document understanding (DocVQA / InfoVQA) and ESG / synthetic domains.
  • 4 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU.
  • Apache 2.0, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

  1. Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
  2. Query-conditioned latent gating (GateScalars). The router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware — e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page.
  3. Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars: final = base + sigmoid(g_s)·shared_out + sigmoid(g_e)·specialist_out.
  4. Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01·router-z² to keep all 4 experts useful and suppress routing collapse.
  5. 3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) → (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) → (c) joint retrieval with KL distillation from the dense baseline (distillation_weight=0.5).

The router sits near the top of the backbone (layer −5) so the gating decision is informed by deep visual semantics rather than raw patch features.

Model details

Property Value
Base model Qwen/Qwen3.5-VL-4B-Instruct
Total parameters 4.71 B
Per-token embedding dim 1024
Max visual tokens / page 2048
Max text tokens 32 768
Similarity function MaxSim (ColBERT / ColPali-style late interaction)
MoE specialists 4 latent + 1 shared dense
Top-k experts per token 2
Region size (visual chunking) 4 (so each region = 4 visual tokens)
Router placement backbone layer −5
Routing aux losses load balance + KL-uniform + 0.01 · router-z²
Weight precision (this release) float32
License Apache 2.0
Model size on disk ~18 GB
VRAM @ bf16 inference ~9.4 GB

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with the official mteb 2.12 library on the published weights. Per the bf16-merge memo, the fp32 release is ~0.1 pp higher on V1 average and ~0.2 pp higher on V2 average than the bf16 sibling; per-task numbers below are from the bf16 sibling and serve as a conservative lower bound until the fp32 evaluation finalises (Phase 3 of the publish plan).

Task bf16 nDCG@5 fp32 expected
ArxivQA 0.9126 ≥ 0.9126
DocVQA 0.6779 🏆 ≥ 0.6779
InfoVQA 0.9447 ≥ 0.9447
ShiftProject 0.9346 ≥ 0.9346
SyntheticDocQA-AI 0.9926 ≥ 0.9926
SyntheticDocQA-Energy 0.9750 ≥ 0.9750
SyntheticDocQA-Government 0.9779 ≥ 0.9779
SyntheticDocQA-Healthcare 0.9963 🏆 ≥ 0.9963
TabFQuAD 0.9544 ≥ 0.9544
TatDQA 0.8485 ≥ 0.8485
Average 0.9214 ≈ 0.9224

🏆 = best in the 4B class for that task (cross-checked against published numbers from Ops-Colqwen3-4B, TomoroAI-colqwen3-embed-4b, SauerkrautLM-ColQwen3-4b, athrael-soju-colqwen3.5-4.5B).

ViDoRe v1 — 4B-class leaderboard comparison

Rank Model Params dim V1 avg
1 Argus-Colqwen3.5-4b-v0 (this, fp32) 4.0 B 1024 0.9224
2 nvidia/llama-nemotron-colembed-vl-3b-v2 3.0 B hidden 0.917
3 nvidia/nemotron-colembed-vl-4b-v2 4.0 B hidden 0.916
4 athrael-soju/colqwen3.5-4.5B-v3 4.5 B 320 0.915
5 OpenSearch-AI/Ops-Colqwen3-4B 4.0 B 2560 0.914
6 nvidia/llama-nemoretriever-colembed-3b-v1 3.0 B 512 0.910
7 TomoroAI/tomoro-colqwen3-embed-4b 4.0 B 320 0.906
8 VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 4.0 B 128 0.908

(Only model surpassing Argus-4B on V1 overall is the 8B Nemotron-vl-8b-v2 at 0.927.)

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task bf16 nDCG@5 fp32 expected
BioMedicalLectures 0.6349 ≥ 0.6349
ESGReports-HighLevel 0.7079 ≥ 0.7079
ESGReports 0.6175 ≥ 0.6175
EconomicsReports 0.5918 ≥ 0.5918
Average 0.6380 ≈ 0.6404

ViDoRe v2 — 4B-class context

Model V2 avg
Ops-Colqwen3-4B (dim 2560) 0.687
TomoroAI/tomoro-colqwen3-embed-4b 0.660
Argus-Colqwen3.5-4b-v0 (fp32) 0.640

V2 is the area we are still actively improving — the wider 2560-d head used by Ops gives an advantage on the more layout-heavy ESG and economics pages. Argus's per-token compression to 1024-d is a 3× storage saving over Ops at the cost of a small V2 gap; the V1 lead more than compensates for retrieval workloads dominated by document QA.

ViDoRe v3

Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.

Storage cost

Per-document storage for an indexed corpus, assuming bf16:

Model Tokens/page Dim Bytes/page
Ops-Colqwen3-4B 1280 2560 6.6 MB
Argus-Colqwen3.5-4b-v0 2048 1024 4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b 1280 320 0.8 MB
SauerkrautLM-ColQwen3-4b-v0.1 1024 128 0.3 MB

Argus uses more tokens (2048 vs 1280) so the router has enough spatial granularity for region-aware specialisation, but the narrow 1024-d head keeps total per-page storage 36 % smaller than Ops despite the higher token count.

Installation

# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"

# MTEB 2.12 ships transformers 4.57.6 by default — upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"

# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation

After upgrading transformers, wipe the cached remote-code modules so the new ones load:

rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage — text + image retrieval

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-4b-v0"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.bfloat16    # or torch.float32 for max precision

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=DTYPE,
    attn_implementation="flash_attention_2",   # or None / "sdpa"
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)         # list of (Lq, 1024)
d_emb  = model.encode_images(processor, documents)         # list of (Ld, 1024)
scores = processor.score(q_emb, d_emb)                     # MaxSim, shape (len(q), len(d))
print(scores)

Reproduce the leaderboard ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-4b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})

A single H100 80 GB completes the full V1 + V2 run in roughly 4–6 hours.

Reproduce on the official ViDoRe-benchmark library

pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
  --model-class colqwen2 \
  --model-name DataScience-UIBK/Argus-Colqwen3.5-4b-v0 \
  --collection-name vidore-v1

Training

Setting Value
Backbone Qwen/Qwen3.5-VL-4B-Instruct (Apache-2.0)
Stage 1 — dense baseline trains the standard ColPali head; serves as the teacher
Stage 2 — MoE balance warmup gates frozen, no PEFT, short — only goal is to prevent expert collapse
Stage 3 — joint retrieval w/ distillation PEFT on, gates trainable, KL distillation from stage-1 teacher (distillation_weight=0.5)
LoRA rank 32 (folded into base for this release via merge_and_unload() in fp32)
Datasets vidore/colpali_train_set + llamaindex/vdr-multilingual-train (subsets)
Hardware 4 × NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster)
Optimiser AdamW, lr = 5e-5 with linear warmup
Precision bf16 forward / fp32 master + LoRA
Effective batch size 64

The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16, which is ~0.1 pp lower on V1 and ~0.2 pp lower on V2 — see the bf16 sibling card.

Limitations

  • English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
  • 4 experts × top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
  • ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.
  • Per-task numbers above use the bf16 sibling as a conservative lower bound until the fp32 reproducer run completes; they will be replaced with the fp32 numbers in a follow-up commit.

License

Apache 2.0, inherited from Qwen3.5-VL-4B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-4b-v0},
}

Contact

  • Org: DataScience-UIBK, University of Innsbruck
  • Issues: open one on this repo's Community tab.
Downloads last month
23
Safetensors
Model size
5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train DataScience-UIBK/Argus-Colqwen3.5-4b-v0

Spaces using DataScience-UIBK/Argus-Colqwen3.5-4b-v0 2