Argus-Colqwen3.5-2b-v0 · fp32 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-2b-v0 is a 2.3-billion-parameter visual-document retriever built on Qwen3.5-VL-2B-Instruct. It uses a ColPali-style multi-vector (MaxSim) late-interaction head, and replaces the dense projection with a query-conditioned latent mixture of experts (MoE) that routes regions of visual tokens through one of four specialists conditioned on the query.

This is the fp32 merged release — the LoRA adapter is folded into the base in float32 to preserve trained precision. A bfloat16 companion lives at DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16 for the smallest deployable artefact. The 4B sibling lives at DataScience-UIBK/Argus-Colqwen3.5-4b-v0.

TL;DR — leaderboard standing

  • Strong on the ViDoRe v1 leaderboard at the 2B scale (V1 = 0.9149) — competitive with nomic-ai/colnomic-embed-multimodal-3b (V1 = 0.916) at 2/3 the parameter count.
  • Best 2B-class result on V2 (V2 = 0.6152), comfortably ahead of vidore/colpali-v1.3 and Metric-AI/colqwen2.5-3b-multilingual at the same scale.
  • 2.3 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 16 GB GPU at bf16 inference.
  • Apache 2.0, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

  1. Region pooling. Visual tokens from the backbone are grouped into 4-token regions, giving the router a coarser but spatially-aware view of the page.
  2. Query-conditioned latent gating (GateScalars). The router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware — e.g. a financial-numbers query routes through a different expert than a layout query, even on the exact same page.
  3. Sparse top-k=2 of 4 latent specialists, fused with the always-on shared dense expert via two learnable gating scalars: final = base + sigmoid(g_s)·shared_out + sigmoid(g_e)·specialist_out.
  4. Region-aware load balancing. Auxiliary losses combine load balance + KL-uniform + 0.01·router-z² to keep all 4 experts useful and suppress routing collapse.
  5. 3-stage curriculum. (a) Dense baseline (no MoE, also serves as teacher) → (b) MoE balance warmup (gates frozen, no PEFT, just stop expert collapse) → (c) joint retrieval with KL distillation from the dense baseline (distillation_weight=0.5).

The router sits near the top of the backbone (layer −5) so the gating decision is informed by deep visual semantics rather than raw patch features.

Model details

Property Value
Base model Qwen/Qwen3.5-VL-2B-Instruct
Total parameters 2.32 B
Per-token embedding dim 1024
Max visual tokens / page 2048
Max text tokens 32 768
Similarity function MaxSim (ColBERT / ColPali-style late interaction)
MoE specialists 4 latent + 1 shared dense
Top-k experts per token 2
Region size (visual chunking) 4 (so each region = 4 visual tokens)
Router placement backbone layer −5
Routing aux losses load balance + KL-uniform + 0.01 · router-z²
Weight precision (this release) float32
License Apache 2.0
Model size on disk ~9.3 GB
VRAM @ bf16 inference ~5.5 GB

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with mteb 2.12 on the published weights, side-by-side with the 4B sibling and both bf16 companions for transparency.

Task 2B fp32 (this) 2B bf16 4B fp32 4B bf16
ArxivQA 0.9027 0.9027 0.9095 0.9126
DocVQA 0.6747 0.6747 0.6770 0.6779
InfoVQA 0.9497 0.9497 0.9463 0.9447
ShiftProject 0.9133 0.9133 0.9470 0.9346
SyntheticDocQA-AI 0.9963 0.9963 0.9963 0.9926
SyntheticDocQA-Energy 0.9726 0.9726 0.9789 0.9750
SyntheticDocQA-Government 0.9729 0.9729 0.9779 0.9779
SyntheticDocQA-Healthcare 0.9926 0.9926 0.9963 0.9963
TabFQuAD 0.9336 0.9336 0.9533 0.9544
TatDQA 0.8403 0.8403 0.8480 0.8485
Average 0.9149 0.9149 0.9230 0.9214

The 2B model beats the 4B sibling on InfoVQA — interesting evidence that the smaller backbone is enough capacity for layout-driven information retrieval, and the gap to 4B comes mostly from text-heavy tasks (ArxivQA, ShiftProject, TabFQuAD, TatDQA) where deeper LLM reasoning helps.

ViDoRe v1 — 2B / 3B-class leaderboard comparison

Rank Model Params dim V1 avg
1 nomic-ai/colnomic-embed-multimodal-3b 3.0 B 128 0.916
2 Argus-Colqwen3.5-2b-v0 (this, fp32) 2.3 B 1024 0.9149
3 Metric-AI/colqwen2.5-3b-multilingual 3.1 B 128 0.892
4 vidore/colpali-v1.3 2.9 B 128 0.844

Argus matches nomic's 3B-class result at smaller scale and a wider per-token dim, and is the strongest sub-3B retriever published to date.

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task 2B fp32 (this) 2B bf16 4B fp32 4B bf16
BioMedicalLectures 0.6499 0.6499 0.6438 0.6349
ESGReports-HighLevel 0.6936 0.6936 0.6991 0.7079
ESGReports 0.5988 0.5988 0.6218 0.6175
EconomicsReports 0.5186 0.5186 0.5980 0.5918
Average 0.6152 0.6152 0.6407 0.6380

The 2B model beats the 4B sibling on BioMedicalLectures but loses ~0.08 on EconomicsReports, where the 4B's deeper reasoning helps with the dense numeric content.

ViDoRe v2 — 2B / 3B-class context

Model V2 avg
Argus-Colqwen3.5-2b-v0 (fp32) 0.6152
nomic-ai/colnomic-embed-multimodal-3b 0.616
Metric-AI/colqwen2.5-3b-multilingual 0.580

ViDoRe v3

Not yet evaluated for this release. Numbers will be added in a follow-up commit once the v3 reproducer run completes.

Storage cost

Per-document storage for an indexed corpus, assuming bf16 token embeddings:

Model Tokens/page Dim Bytes/page
Ops-Colqwen3-4B 1280 2560 6.6 MB
Argus-Colqwen3.5-4b-v0 2048 1024 4.2 MB
Argus-Colqwen3.5-2b-v0 2048 1024 4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b 1280 320 0.8 MB

Per-page storage is identical between the 2B and 4B Argus releases — both use the same 1024-d head and same 2048-token visual budget. The choice between them is inference cost (2B is ~50 % faster), not storage.

Installation

# Qwen3.5-VL is only in transformers 5.x
pip install "transformers>=5.0.0,<6.0.0"

# MTEB 2.12 ships transformers 4.57.6 by default — upgrade explicitly afterwards
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"

# Optional: faster attention on Hopper / Ampere
pip install flash-attn==2.6.3 --no-build-isolation

After upgrading transformers, wipe the cached remote-code modules so the new ones load:

rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage — text + image retrieval

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-2b-v0"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.bfloat16    # or torch.float32 for max precision

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=DTYPE,
    attn_implementation="flash_attention_2",   # or None / "sdpa"
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)         # list of (Lq, 1024)
d_emb  = model.encode_images(processor, documents)         # list of (Ld, 1024)
scores = processor.score(q_emb, d_emb)                     # MaxSim, shape (len(q), len(d))
print(scores)

Reproduce the leaderboard ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-2b-v0")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})

A single H100 80 GB completes the full V1 + V2 run in roughly 2–3 hours for the 2B (about half the 4B runtime).

Reproduce on the official ViDoRe-benchmark library

pip install vidore-benchmark
vidore-benchmark evaluate-retriever \
  --model-class colqwen2 \
  --model-name DataScience-UIBK/Argus-Colqwen3.5-2b-v0 \
  --collection-name vidore-v1

Training

Setting Value
Backbone Qwen/Qwen3.5-VL-2B-Instruct (Apache-2.0)
Stage 1 — dense baseline trains the standard ColPali head; serves as the teacher
Stage 2 — MoE balance warmup gates frozen, no PEFT, short — only goal is to prevent expert collapse
Stage 3 — joint retrieval w/ distillation PEFT on, gates trainable, KL distillation from stage-1 teacher (distillation_weight=0.5)
LoRA rank 32 (folded into base for this release via merge_and_unload() in fp32)
Datasets vidore/colpali_train_set + llamaindex/vdr-multilingual-train (subsets)
Hardware 4 × NVIDIA H100 80 GB (zen4_0768_h100x4 partition, UIBK LEO5 cluster)
Optimiser AdamW, lr = 5e-5 with linear warmup
Precision bf16 forward / fp32 master + LoRA
Effective batch size 64

The merge step that produced this release was run in float32 throughout (merge_and_unload() on the LoRA adapter, then sharded to safetensors). The companion bf16 release ran the same merge in bfloat16 — at the 2B scale the bf16 merge is numerically identical to fp32 within nDCG@5 measurement noise (see the bf16 sibling card for details).

When to use 2B vs 4B

Use case Recommendation
Maximum recall on document QA / leaderboard parity 4B fp32
Latency-sensitive retrieval, batch indexing of large corpora 2B fp32 (this) or 2B bf16
Smallest deployable artefact, edge / single-GPU serving 2B bf16
InfoVQA / BioMedicalLectures specifically 2B fp32 beats 4B on these

Limitations

  • English-dominant; the multilingual training subset is small and we omit multilingual eval from this release.
  • 4 experts × top-2 routing adds ~5 % to total inference latency vs the dense backbone (the LLM dominates total cost).
  • ViDoRe v3 numbers are pending; will be added once the public reproducer run finishes.
  • The 2B / 4B per-task gap is largest on text-heavy tasks (ShiftProject, TabFQuAD, TatDQA) where the larger LLM reasons more about query semantics; for layout-driven tasks the 2B is competitive or better.

License

Apache 2.0, inherited from Qwen3.5-VL-2B-Instruct. You may use, modify, and redistribute this model commercially, with attribution.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0},
}

Contact

  • Org: DataScience-UIBK, University of Innsbruck
  • Issues: open one on this repo's Community tab.
Downloads last month
27
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train DataScience-UIBK/Argus-Colqwen3.5-2b-v0

Spaces using DataScience-UIBK/Argus-Colqwen3.5-2b-v0 2