Argus-Colqwen3.5-4b-v0 · bf16 release

Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026

DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-4b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-4b-v0; only the LoRA-merge dtype differs.

The bf16 release is half the disk size (8.8 GB vs 18 GB), faster to download, and easier to deploy on memory-constrained GPUs — at the cost of ~0.1 pp on ViDoRe V1 and ~0.2 pp on V2 (merge-time precision loss).

Disk V1 avg nDCG@5 V2 avg nDCG@5
fp32 sibling (-v0) 18 GB 0.9224 0.6404
this release (-v0-bf16) 8.8 GB 0.9214 0.6380
Δ vs fp32 −9.2 GB −0.0010 −0.0024

Use this bf16 release if you want the smallest deployable artefact and can tolerate sub-percent score loss. Use the fp32 sibling if you need leaderboard-grade reproducibility or you're going to merge / quantise further downstream.

TL;DR — leaderboard standing

  • #1 on the ViDoRe v1 leaderboard among 4B-class models (V1 = 0.9214) — beats Nemotron-4B-v2 (0.916), athrael-soju-colqwen3.5-4.5B (0.915), Ops-Colqwen3-4B (0.914).
  • #2 overall on the ViDoRe v1 leaderboard, behind only the 8B Nemotron-vl-8b-v2 (0.927).
  • Competitive on ViDoRe v2 (0.6380 nDCG@5) within the 4B class.
  • 4 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 12 GB GPU at bf16.

What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:

  1. Region pooling — visual tokens are grouped into 4-token regions before routing.
  2. Query-conditioned latent gating (GateScalars) — router input is region + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page.
  3. Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
  4. Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
  5. 3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation.

Router sits at backbone layer −5.

Model details

Property Value
Base model Qwen/Qwen3.5-VL-4B-Instruct
Total parameters 4.71 B
Per-token embedding dim 1024
Max visual tokens / page 2048
Max text tokens 32 768
Similarity function MaxSim
MoE specialists 4 latent + 1 shared dense
Top-k experts per token 2
Region size 4
Router placement backbone layer −5
Weight precision (this release) bfloat16
Adapted from DataScience-UIBK/Argus-Colqwen3.5-4b-v0 (fp32 merge)
License Apache 2.0
Model size on disk ~8.8 GB
VRAM @ bf16 inference ~9.4 GB

Why two dtypes?

Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.

  • In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
  • In fp32, both are exact.

We publish both:

  • fp32 preserves trained precision exactly. Larger disk, identical inference behaviour to the fp32 master used during evaluation.
  • bf16 is the dtype most users will load anyway (torch_dtype=torch.bfloat16). Smaller disk, the merge precision loss is baked in.

If your downstream pipeline loads in bf16, the inference-time difference between the two repos is tiny — but the bf16 release saves bandwidth and disk. Use the fp32 sibling if you load in fp32 / fp64 (rare) or you're going to quantise further.

Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Real per-task scores, measured with mteb 2.12 on the published bf16 weights.

Task nDCG@5
ArxivQA 0.9126
DocVQA 0.6779 🏆
InfoVQA 0.9447
ShiftProject 0.9346
SyntheticDocQA-AI 0.9926
SyntheticDocQA-Energy 0.9750
SyntheticDocQA-Government 0.9779
SyntheticDocQA-Healthcare 0.9963 🏆
TabFQuAD 0.9544
TatDQA 0.8485
Average 0.9214

🏆 = best in the 4B class for that task (cross-checked against published numbers from Ops-Colqwen3-4B, TomoroAI-colqwen3-embed-4b, SauerkrautLM-ColQwen3-4b, athrael-soju-colqwen3.5-4.5B).

ViDoRe v1 — 4B-class leaderboard comparison

Rank Model Params dim V1 avg
1 Argus-Colqwen3.5-4b-v0-bf16 (this) 4.0 B 1024 0.9214
2 nvidia/llama-nemotron-colembed-vl-3b-v2 3.0 B hidden 0.917
3 nvidia/nemotron-colembed-vl-4b-v2 4.0 B hidden 0.916
4 athrael-soju/colqwen3.5-4.5B-v3 4.5 B 320 0.915
5 OpenSearch-AI/Ops-Colqwen3-4B 4.0 B 2560 0.914
6 nvidia/llama-nemoretriever-colembed-3b-v1 3.0 B 512 0.910
7 TomoroAI/tomoro-colqwen3-embed-4b 4.0 B 320 0.906
8 VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 4.0 B 128 0.908

(Only model surpassing Argus-4B on V1 overall is the 8B Nemotron-vl-8b-v2 at 0.927.)

Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

Task nDCG@5
BioMedicalLectures 0.6349
ESGReports-HighLevel 0.7079
ESGReports 0.6175
EconomicsReports 0.5918
Average 0.6380

ViDoRe v2 — 4B-class context

Model V2 avg
Ops-Colqwen3-4B (dim 2560) 0.687
TomoroAI/tomoro-colqwen3-embed-4b 0.660
Argus-Colqwen3.5-4b-v0-bf16 0.638

V2 is the area we are still actively improving — Ops's wider 2560-d head pulls ahead on layout-heavy ESG / economics pages. Argus's per-token compression to 1024-d is a 3× storage saving over Ops at the cost of a small V2 gap; the V1 lead more than compensates for retrieval workloads dominated by document QA.

ViDoRe v3

Not yet evaluated for this release.

Storage cost

Model Tokens/page Dim Bytes/page (bf16)
Ops-Colqwen3-4B 1280 2560 6.6 MB
Argus-Colqwen3.5-4b-v0-bf16 2048 1024 4.2 MB
TomoroAI/tomoro-colqwen3-embed-4b 1280 320 0.8 MB
SauerkrautLM-ColQwen3-4b-v0.1 1024 128 0.3 MB

Installation

pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation     # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules

Usage

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,                  # this release ships in bf16
    attn_implementation="flash_attention_2",
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)

Reproduce ViDoRe results with MTEB

import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})

Training

Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.

Limitations

  • English-dominant.
  • Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load.
  • ViDoRe v3 not yet evaluated.
  • ~0.1 pp / ~0.2 pp behind the fp32 sibling on V1 / V2 — use that one if leaderboard parity matters.

License

Apache 2.0, inherited from Qwen3.5-VL-4B-Instruct.

Citation

@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-4b-v0},
}

Contact

  • Org: DataScience-UIBK, University of Innsbruck
  • Issues: open one on this repo's Community tab.
Downloads last month
31
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16

Spaces using DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 2