Instructions to use DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16", trust_remote_code=True, dtype="auto") - ColPali
How to use DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Argus-Colqwen3.5-9b-v0 · bf16 release
- TL;DR — leaderboard standing
- What is novel here
- Model details
- Why two dtypes?
- Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)
- Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)
- ViDoRe v3
- Storage cost
- Installation
- Usage
- Reproduce ViDoRe results with MTEB
- Training
- When to use which Argus variant
- Limitations
- License
- Citation
- Contact
- TL;DR — leaderboard standing
Argus-Colqwen3.5-9b-v0 · bf16 release
Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval University of Innsbruck — Data Science group · 2026
DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16 is the bfloat16 merged release of Argus-Colqwen3.5-9b. It is the exact same trained network as the fp32 sibling DataScience-UIBK/Argus-Colqwen3.5-9b-v0; only the LoRA-merge dtype differs.
The bf16 release is half the disk size (17 GB vs 33 GB), faster to download, and easier to deploy on memory-constrained GPUs — at the cost of ~0.001 pp on V1 and ~0.002 pp on V2 (well within ViDoRe eval noise).
| Disk | V1 avg nDCG@5 | V2 avg nDCG@5 | |
|---|---|---|---|
fp32 sibling (-v0) |
33 GB | 0.9267 | 0.6915 |
this release (-v0-bf16) |
17 GB | 0.9259 | 0.6897 |
| Δ vs fp32 | −16 GB | −0.0008 | −0.0018 |
Use this bf16 release if you want a smaller deployable artefact and can tolerate sub-percent score differences (almost every deployment scenario). Use the fp32 sibling if you need leaderboard-grade reproducibility or you're going to merge / quantise further downstream.
TL;DR — leaderboard standing
- Co-leads the ViDoRe v1 leaderboard at V1 = 0.9259 — within 0.001 of the 8B Nemotron-vl-8b-v2 (0.927) and ahead of every other public retriever.
- V2 = 0.6897 — best Argus result on V2, +0.05 over the 4B sibling.
- 8.8 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single 24 GB GPU at bf16 inference.
What is novel here
Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. Argus replaces that dense head with a sparse mixture in which the gates depend on both the visual token and a pooled query summary, so the same page gets routed differently for different queries:
- Region pooling — visual tokens are grouped into 4-token regions before routing.
- Query-conditioned latent gating (
GateScalars) — router input isregion + region_coord_proj(coords) + query_context_proj(pooled_query). The query summary makes routing task-aware: a financial-numbers query routes through a different expert than a layout query, on the same page. - Sparse top-k=2 of 4 latent specialists, fused with an always-on shared dense expert via two learnable gating scalars.
- Region-aware load balancing — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
- 3-stage curriculum — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation. The 9B joint stage was extended on VDR1.5M + Docmatix (
vdr_docmatix_full).
Router sits at backbone layer −5.
Model details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-VL-9B-Instruct |
| Total parameters | 8.82 B |
| Per-token embedding dim | 1024 |
| Max visual tokens / page | 2048 |
| Max text tokens | 32 768 |
| Similarity function | MaxSim |
| MoE specialists | 4 latent + 1 shared dense |
| Top-k experts per token | 2 |
| Region size | 4 |
| Router placement | backbone layer −5 |
| Weight precision (this release) | bfloat16 |
| Adapted from | DataScience-UIBK/Argus-Colqwen3.5-9b-v0 (fp32 merge) |
| License | Apache 2.0 |
| Model size on disk | ~17 GB |
| VRAM @ bf16 inference | ~17 GB |
Why two dtypes?
Merging a LoRA into the base requires materialising (α/r)·A·B and adding it to the base weight matrix.
- In bf16, both the delta cast and the addition lose precision — the gap is small but irreversible.
- In fp32, both are exact.
For the 9B release the merge cost is ~0.001 V1 / ~0.002 V2 — within the noise floor of a single ViDoRe run. We publish both:
- fp32 preserves trained precision exactly. Larger disk, identical inference behaviour to the fp32 master used during evaluation.
- bf16 is the dtype most users will load anyway. Half the disk, the merge precision loss is baked in but immaterial in practice.
If your downstream pipeline loads in bf16, the bf16 release is the strict win — same scores within rounding, half the disk.
Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)
Real per-task scores, measured with mteb 2.12 on the published bf16 weights, side-by-side with every Argus sibling.
| Task | 2B fp32 | 2B bf16 | 4B fp32 | 4B bf16 | 9B fp32 | 9B bf16 (this) |
|---|---|---|---|---|---|---|
| ArxivQA | 0.9027 | 0.9027 | 0.9095 | 0.9126 | 0.9228 | 0.9217 |
| DocVQA | 0.6747 | 0.6747 | 0.6770 | 0.6779 | 0.6809 | 0.6826 |
| InfoVQA | 0.9497 | 0.9497 | 0.9463 | 0.9447 | 0.9426 | 0.9449 |
| ShiftProject | 0.9133 | 0.9133 | 0.9470 | 0.9346 | 0.9365 | 0.9298 |
| SyntheticDocQA-AI | 0.9963 | 0.9963 | 0.9963 | 0.9926 | 0.9963 | 0.9926 |
| SyntheticDocQA-Energy | 0.9726 | 0.9726 | 0.9789 | 0.9750 | 0.9732 | 0.9769 |
| SyntheticDocQA-Government | 0.9729 | 0.9729 | 0.9779 | 0.9779 | 0.9889 | 0.9889 |
| SyntheticDocQA-Healthcare | 0.9926 | 0.9926 | 0.9963 | 0.9963 | 0.9963 | 0.9926 |
| TabFQuAD | 0.9336 | 0.9336 | 0.9533 | 0.9544 | 0.9750 | 0.9724 |
| TatDQA | 0.8403 | 0.8403 | 0.8480 | 0.8485 | 0.8545 | 0.8567 |
| Average | 0.9149 | 0.9149 | 0.9230 | 0.9214 | 0.9267 | 0.9259 |
ViDoRe v1 — overall leaderboard comparison
| Rank | Model | Params | dim | V1 avg |
|---|---|---|---|---|
| 1 | nvidia/nemotron-vl-8b-v2 | 8.0 B | hidden | 0.927 |
| 1 | Argus-Colqwen3.5-9b-v0 (fp32 sibling) | 8.8 B | 1024 | 0.9267 |
| 3 | Argus-Colqwen3.5-9b-v0-bf16 (this) | 8.8 B | 1024 | 0.9259 |
| 4 | Argus-Colqwen3.5-4b-v0 (sibling, fp32) | 4.0 B | 1024 | 0.9230 |
| 5 | nvidia/llama-nemotron-colembed-vl-3b-v2 | 3.0 B | hidden | 0.917 |
| 6 | nvidia/nemotron-colembed-vl-4b-v2 | 4.0 B | hidden | 0.916 |
| 7 | athrael-soju/colqwen3.5-4.5B-v3 | 4.5 B | 320 | 0.915 |
| 8 | OpenSearch-AI/Ops-Colqwen3-4B | 4.0 B | 2560 | 0.914 |
Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)
| Task | 2B fp32 | 2B bf16 | 4B fp32 | 4B bf16 | 9B fp32 | 9B bf16 (this) |
|---|---|---|---|---|---|---|
| BioMedicalLectures | 0.6499 | 0.6499 | 0.6438 | 0.6349 | 0.6619 | 0.6633 |
| ESGReports-HighLevel | 0.6936 | 0.6936 | 0.6991 | 0.7079 | 0.7905 | 0.7912 |
| ESGReports | 0.5988 | 0.5988 | 0.6218 | 0.6175 | 0.6760 | 0.6764 |
| EconomicsReports | 0.5186 | 0.5186 | 0.5980 | 0.5918 | 0.6377 | 0.6278 |
| Average | 0.6152 | 0.6152 | 0.6407 | 0.6380 | 0.6915 | 0.6897 |
ViDoRe v2 — overall context
| Model | V2 avg |
|---|---|
| Argus-Colqwen3.5-9b-v0 (fp32 sibling) | 0.6915 |
| Argus-Colqwen3.5-9b-v0-bf16 (this) | 0.6897 |
| Ops-Colqwen3-4B (dim 2560) | 0.687 |
| TomoroAI/tomoro-colqwen3-embed-4b | 0.660 |
| Argus-Colqwen3.5-4b-v0 (sibling, fp32) | 0.6407 |
ViDoRe v3
Not yet evaluated for this release.
Storage cost
| Model | Tokens/page | Dim | Bytes/page (bf16) |
|---|---|---|---|
| Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB |
| Argus-Colqwen3.5-9b-v0-bf16 | 2048 | 1024 | 4.2 MB |
| Argus-Colqwen3.5-4b-v0-bf16 | 2048 | 1024 | 4.2 MB |
| TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB |
Installation
pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules
Usage
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # this release ships in bf16
attn_implementation="flash_attention_2",
device_map=DEVICE,
).eval()
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=2048,
)
queries = [
"What is the company's revenue in 2019?",
"How does the proposed model compare to baselines?",
]
documents = [
Image.open("page_a.png").convert("RGB"),
Image.open("page_b.png").convert("RGB"),
]
q_emb = model.encode_queries(processor, queries)
d_emb = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)
Reproduce ViDoRe results with MTEB
import mteb
m = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-9b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 2})
Training
Same recipe as the fp32 sibling — see its card for full details. Only difference is the merge dtype.
When to use which Argus variant
| Use case | Recommendation |
|---|---|
| Smallest deployable artefact at the 9B scale | 9B bf16 (this) — strict winner over 9B fp32 at half the disk |
| Maximum precision at 9B for downstream merging / quantisation | 9B fp32 |
| Latency-sensitive retrieval / 24 GB GPU budget | 4B bf16 |
| Smallest model still in the 0.91+ V1 league | 2B bf16 (4.6 GB) |
Limitations
- English-dominant.
- Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load.
- ViDoRe v3 not yet evaluated.
- Needs ~17 GB VRAM at bf16 inference; single-GPU users need ≥ 24 GB.
License
Apache 2.0, inherited from Qwen3.5-VL-9B-Instruct.
Citation
@misc{argus2026,
title = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
author = {DataScience-UIBK team},
year = {2026},
url = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-9b-v0},
}
Contact
- Org: DataScience-UIBK, University of Innsbruck
- Issues: open one on this repo's Community tab.
- Downloads last month
- 291