Daily Model Scout Report — 2026-04-29
#18
by msudharsanan - opened
Daily Model Scout Report — 2026-04-29
Scan window: 2026-04-22 → 2026-04-29 (last 7 days).
Filter: image-text-to-text VLMs that could improve garment classification, eval'd against our current best models on the 3,500-sample hard eval set (weighted_score metric):
| Current best | Score |
|---|---|
qwen3-vl-8b-sft+grpo |
0.9131 |
qwen3-vl-2b-sft-grpo-v9 |
0.8948 |
qwen3-vl-8b-sft-grpo-nvfp4 |
0.8945 |
qwen35-2b-base |
0.8437 |
(Note: granite4-vision-sft shows 1.0144 in the JSON — bad data point, excluded.)
High Priority — Benchmark Immediately
1. Qwen3.6-27B (dense) — Apache 2.0
- HF: https://huggingface.co/Qwen/Qwen3.6-27B
- FP8 quant: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
- 27B dense, native VLM (image + video + text), 262K context (1M with YaRN)
- Released 2026-04-21. Already 508K downloads, 991 likes in 8 days.
- Vision benchmarks (vendor-reported): MMMU 81.7, RealWorldQA 85.3, MMBench 92.8, VideoMMU 83.7
- Why it could beat our 0.9131:
- 3.4× larger than our top Qwen3-VL-8B base, same architecture family — our existing SFT+GRPO recipe should port over with minimal changes.
- Dense (not MoE) means GRPO/GTPO trains cleanly without expert-routing complications.
- FP8 weights ≈ 27 GB — fits comfortably on the RTX PRO 6000 98GB with headroom for KV cache + LoRA training.
- Action: Run zero-shot eval of
Qwen3.6-27B-FP8on the 3.5K hard set this week to set a baseline; if ≥0.85, kick off SFT.
2. Qwen3.6-35B-A3B (MoE) — Apache 2.0
- HF: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
- FP8 quant: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
- 35B total / 3B active MoE (256 experts, 8 routed + 1 shared), native VLM, hybrid Gated DeltaNet + Gated Attention layers
- Released 2026-04-15. 1.5M downloads, 1507 likes — most downloaded recent VLM by a wide margin.
- Vision benchmarks comparable to 27B dense; similar VL benchmarks
- Why it could beat our 0.9131:
- Inference cost ≈ 3B-active model → potentially faster than our 8B at higher quality.
- Apache-2.0, FP8 already published, plus community NVFP4/MLX quants exist (e.g.
igf-oeaw/Qwen3.6-27B-NVFP4A16-VL-MTP).
- Caveats:
- MoE PEFT/GRPO is trickier (expert balance loss, router stability). Worth piloting only after the 27B dense run shows lift.
- 35B FP8 ≈ 35GB still fits on 98GB but leaves less headroom for 32K image-text training contexts.
- Action: Defer until 27B dense is benchmarked. If 27B shows lift, attempt MoE run with FP8 + LoRA on routed experts only.
3. Granite Vision 4.1 4B — Apache 2.0
- HF: https://huggingface.co/ibm-granite/granite-vision-4.1-4b
- 4B (3.4B LLM + 0.6B vision), SigLIP2-SO400M-patch16-384 encoder + LoRA adapters, 8 vision-to-LLM injection points (LayerDeepstack + SpatialDeepstack)
- Released 2026-04-29 (today). Direct successor to
granite-4.0-3b-vision, the base of our current top 100-eval model (Granite4-Vision-SFT@ 88.25%). - Vendor benchmarks emphasize structured extraction: 94.4% zero-shot KVP exact-match on VAREX — directly analogous to our 9-field JSON schema task.
- Why it could beat our 0.9131:
- Backward compatible with 4.0 — drop-in retrain of our existing SFT recipe.
- SpatialDeepstack injection points are designed for fine-grained visual feature retention, helpful for pattern/closure/sleeve discrimination where our Qwen models trail (Qwen3-VL-8B SFT+GRPO: 62% pattern, 42% closure on 100-eval).
- Tiny footprint (4B) → fast iteration; could SFT in <2h.
- Action: Highest-leverage candidate. Re-run our existing Granite-4 SFT pipeline against the new 4.1-4b base this week and compare to
Granite4-Vision-SFT.
Medium Priority — Worth Watching
4. NVIDIA Nemotron-3-Nano-Omni 30B-A3B Reasoning — NVIDIA Open Model License
- HF: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning
- 30B MoE / 3B active, hybrid Mamba/Transformer (NemotronH), any-to-any modality pipeline tag
- Released 2026-04-20. 9.8K downloads, 130 likes.
- Architecture is novel (Mamba backbone) — would require vLLM/transformers branch checks.
- License is "other" (NVIDIA Open Model License) — needs legal review before any production use.
- Action: Track for 1–2 weeks until community quants and serving recipes mature. Don't invest training time yet.
Low Priority — Tangential
5. Hcompany/Holotron-3-Nano (2026-04-27)
- HF: https://huggingface.co/Hcompany/Holotron-3-Nano
- 33B post-train of NVIDIA Nemotron-3-Nano-Omni, specialized for web/computer-use agents — not aimed at static image classification.
- Same NVIDIA Open Model License gating as #4. Skip for our use case.
6. DINOv3 LVD-1689M finetunes (canvit/*, 2026-04-25)
- Pure vision encoders (linear classifier probes on ImageNet1K). Not VLMs — would need pairing with an LLM head. Tangential to the JSON-extraction objective.
7. mistralai/Mistral-Small-4-119B-2603-eagle (2026-04-27)
- No vision pipeline tag, no vision-language tags in the model card. Text-only LLM. Skip.
Summary
Three concrete, actionable candidates dropped in the last 14 days:
| Rank | Candidate | Size | Effort | Risk | Why it matters here |
|---|---|---|---|---|---|
| 1 | ibm-granite/granite-vision-4.1-4b |
4B | Low (drop-in for existing Granite4 recipe) | Low | Direct upgrade to our current best 100-eval model |
| 2 | Qwen/Qwen3.6-27B-FP8 |
27B dense | Medium (port SFT+GRPO recipe) | Low | Direct architectural successor to our best 3.5K-eval model |
| 3 | Qwen/Qwen3.6-35B-A3B-FP8 |
35B MoE | High (MoE PEFT/GRPO complications) | Medium | Best raw vision benchmarks of the week, fast inference |
Recommended sequence: Granite 4.1-4b → Qwen3.6-27B-FP8 → Qwen3.6-35B-A3B-FP8.