Instructions to use DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16", trust_remote_code=True, dtype="auto") - ColPali
How to use DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Argus-Colqwen3.5-4b-v0 · bf16 release
- TL;DR — leaderboard standing
- What is novel here
- Model details
- Why two dtypes?
- Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)
- Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)
- ViDoRe v3
- Storage cost
- Installation
- Usage
- Reproduce ViDoRe results with MTEB
- Training
- Limitations
- License
- Citation
- Contact
- TL;DR — leaderboard standing
Argus-Colqwen3.5-4b-v0 · bf16 release
Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026
DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-4b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-4b-v0; only the LoRA-merge dtype differs.
The bf16 release is half the disk size (8.8 GB vs 18 GB), faster to download, and easier to deploy on memory-constrained GPUs — at the cost of ~0.1 pp on ViDoRe V1 and ~0.2 pp on V2 (merge-time precision loss).
| Disk | V1 avg nDCG@5 | V2 avg nDCG@5 | |
|---|---|---|---|
fp32 sibling (-v0) |
18 GB | 0.9224 | 0.6404 |
this release (-v0-bf16) |
8.8 GB | 0.9214 | 0.6380 |
| Δ vs fp32 | −9.2 GB | −0.0010 | −0.0024 |
Use this bf16 release if you want the smallest deployable artefact and can tolerate sub-percent score loss. Use the fp32 sibling if you need leaderboard-grade reproducibility or you're going to merge / quantise further downstream.
TL;DR — leaderboard standing
- #1 on the ViDoRe v1 leaderboard among 4B-class models (V1 = 0.9214) — beats Nemotron-4B-v2 (0.916), athrael-soju-colqwen3.5-4.5B (0.915), Ops-Colqwen3-4B (0.914).
- #2 overall on the ViDoRe v1 leaderboard, behind only the 8B Nemotron-vl-8b-v2 (0.927).
- Competitive on ViDoRe v2 (0.6380 nDCG@5) within the 4B class.
- 4 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 12 GB GPU at bf16.
What is novel here
Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:
- Region pooling — visual tokens are grouped into 4-token regions before routing.
- Query-conditioned latent gating (
GateScalars) — router input isregion + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page. - Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
- Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
- 3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation.
Router sits at backbone layer −5.
Model details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-VL-4B-Instruct |
| Total parameters | 4.71 B |
| Per-token embedding dim | 1024 |
| Max visual tokens / page | 2048 |
| Max text tokens | 32 768 |
| Similarity function | MaxSim |
| MoE specialists | 4 latent + 1 shared dense |
| Top-k experts per token | 2 |
| Region size | 4 |
| Router placement | backbone layer −5 |
| Weight precision (this release) | bfloat16 |
| Adapted from | DataScience-UIBK/Argus-Colqwen3.5-4b-v0 (fp32 merge) |
| License | Apache 2.0 |
| Model size on disk | ~8.8 GB |
| VRAM @ bf16 inference | ~9.4 GB |
Why two dtypes?
Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.
- In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
- In fp32, both are exact.
We publish both:
- fp32 preserves trained precision exactly. Larger disk, identical inference behaviour to the fp32 master used during evaluation.
- bf16 is the dtype most users will load anyway (
torch_dtype=torch.bfloat16). Smaller disk, the merge precision loss is baked in.
If your downstream pipeline loads in bf16, the inference-time difference between the two repos is tiny — but the bf16 release saves bandwidth and disk. Use the fp32 sibling if you load in fp32 / fp64 (rare) or you're going to quantise further.
Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)
Real per-task scores, measured with mteb 2.12 on the published bf16 weights.
| Task | nDCG@5 |
|---|---|
| ArxivQA | 0.9126 |
| DocVQA | 0.6779 🏆 |
| InfoVQA | 0.9447 |
| ShiftProject | 0.9346 |
| SyntheticDocQA-AI | 0.9926 |
| SyntheticDocQA-Energy | 0.9750 |
| SyntheticDocQA-Government | 0.9779 |
| SyntheticDocQA-Healthcare | 0.9963 🏆 |
| TabFQuAD | 0.9544 |
| TatDQA | 0.8485 |
| Average | 0.9214 |
🏆 = best in the 4B class for that task (cross-checked against published numbers from Ops-Colqwen3-4B, TomoroAI-colqwen3-embed-4b, SauerkrautLM-ColQwen3-4b, athrael-soju-colqwen3.5-4.5B).
ViDoRe v1 — 4B-class leaderboard comparison
| Rank | Model | Params | dim | V1 avg |
|---|---|---|---|---|
| 1 | Argus-Colqwen3.5-4b-v0-bf16 (this) | 4.0 B | 1024 | 0.9214 |
| 2 | nvidia/llama-nemotron-colembed-vl-3b-v2 | 3.0 B | hidden | 0.917 |
| 3 | nvidia/nemotron-colembed-vl-4b-v2 | 4.0 B | hidden | 0.916 |
| 4 | athrael-soju/colqwen3.5-4.5B-v3 | 4.5 B | 320 | 0.915 |
| 5 | OpenSearch-AI/Ops-Colqwen3-4B | 4.0 B | 2560 | 0.914 |
| 6 | nvidia/llama-nemoretriever-colembed-3b-v1 | 3.0 B | 512 | 0.910 |
| 7 | TomoroAI/tomoro-colqwen3-embed-4b | 4.0 B | 320 | 0.906 |
| 8 | VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 | 4.0 B | 128 | 0.908 |
(Only model surpassing Argus-4B on V1 overall is the 8B Nemotron-vl-8b-v2 at 0.927.)
Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)
| Task | nDCG@5 |
|---|---|
| BioMedicalLectures | 0.6349 |
| ESGReports-HighLevel | 0.7079 |
| ESGReports | 0.6175 |
| EconomicsReports | 0.5918 |
| Average | 0.6380 |
ViDoRe v2 — 4B-class context
| Model | V2 avg |
|---|---|
| Ops-Colqwen3-4B (dim 2560) | 0.687 |
| TomoroAI/tomoro-colqwen3-embed-4b | 0.660 |
| Argus-Colqwen3.5-4b-v0-bf16 | 0.638 |
V2 is the area we are still actively improving — Ops's wider 2560-d head pulls ahead on layout-heavy ESG / economics pages. Argus's per-token compression to 1024-d is a 3× storage saving over Ops at the cost of a small V2 gap; the V1 lead more than compensates for retrieval workloads dominated by document QA.
ViDoRe v3
Not yet evaluated for this release.
Storage cost
| Model | Tokens/page | Dim | Bytes/page (bf16) |
|---|---|---|---|
| Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB |
| Argus-Colqwen3.5-4b-v0-bf16 | 2048 | 1024 | 4.2 MB |
| TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB |
| SauerkrautLM-ColQwen3-4b-v0.1 | 1024 | 128 | 0.3 MB |
Installation
pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules
Usage
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # this release ships in bf16
attn_implementation="flash_attention_2",
device_map=DEVICE,
).eval()
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=2048,
)
queries = [
"What is the company's revenue in 2019?",
"How does the proposed model compare to baselines?",
]
documents = [
Image.open("page_a.png").convert("RGB"),
Image.open("page_b.png").convert("RGB"),
]
q_emb = model.encode_queries(processor, queries)
d_emb = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)
Reproduce ViDoRe results with MTEB
import mteb
m = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-4b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})
Training
Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.
Limitations
- English-dominant.
- Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load.
- ViDoRe v3 not yet evaluated.
- ~0.1 pp / ~0.2 pp behind the fp32 sibling on V1 / V2 — use that one if leaderboard parity matters.
License
Apache 2.0, inherited from Qwen3.5-VL-4B-Instruct.
Citation
@misc{argus2026,
title = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
author = {DataScience-UIBK team},
year = {2026},
url = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-4b-v0},
}
Contact
- Org: DataScience-UIBK, University of Innsbruck
- Issues: open one on this repo's Community tab.
- Downloads last month
- 31