Daily Model Scout Report — 2026-04-29

#18
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report — 2026-04-29

Scan window: 2026-04-22 → 2026-04-29 (last 7 days).
Filter: image-text-to-text VLMs that could improve garment classification, eval'd against our current best models on the 3,500-sample hard eval set (weighted_score metric):

Current best Score
qwen3-vl-8b-sft+grpo 0.9131
qwen3-vl-2b-sft-grpo-v9 0.8948
qwen3-vl-8b-sft-grpo-nvfp4 0.8945
qwen35-2b-base 0.8437

(Note: granite4-vision-sft shows 1.0144 in the JSON — bad data point, excluded.)


High Priority — Benchmark Immediately

1. Qwen3.6-27B (dense) — Apache 2.0

  • HF: https://huggingface.co/Qwen/Qwen3.6-27B
  • FP8 quant: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
  • 27B dense, native VLM (image + video + text), 262K context (1M with YaRN)
  • Released 2026-04-21. Already 508K downloads, 991 likes in 8 days.
  • Vision benchmarks (vendor-reported): MMMU 81.7, RealWorldQA 85.3, MMBench 92.8, VideoMMU 83.7
  • Why it could beat our 0.9131:
    • 3.4× larger than our top Qwen3-VL-8B base, same architecture family — our existing SFT+GRPO recipe should port over with minimal changes.
    • Dense (not MoE) means GRPO/GTPO trains cleanly without expert-routing complications.
    • FP8 weights ≈ 27 GB — fits comfortably on the RTX PRO 6000 98GB with headroom for KV cache + LoRA training.
  • Action: Run zero-shot eval of Qwen3.6-27B-FP8 on the 3.5K hard set this week to set a baseline; if ≥0.85, kick off SFT.

2. Qwen3.6-35B-A3B (MoE) — Apache 2.0

  • HF: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
  • FP8 quant: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
  • 35B total / 3B active MoE (256 experts, 8 routed + 1 shared), native VLM, hybrid Gated DeltaNet + Gated Attention layers
  • Released 2026-04-15. 1.5M downloads, 1507 likes — most downloaded recent VLM by a wide margin.
  • Vision benchmarks comparable to 27B dense; similar VL benchmarks
  • Why it could beat our 0.9131:
    • Inference cost ≈ 3B-active model → potentially faster than our 8B at higher quality.
    • Apache-2.0, FP8 already published, plus community NVFP4/MLX quants exist (e.g. igf-oeaw/Qwen3.6-27B-NVFP4A16-VL-MTP).
  • Caveats:
    • MoE PEFT/GRPO is trickier (expert balance loss, router stability). Worth piloting only after the 27B dense run shows lift.
    • 35B FP8 ≈ 35GB still fits on 98GB but leaves less headroom for 32K image-text training contexts.
  • Action: Defer until 27B dense is benchmarked. If 27B shows lift, attempt MoE run with FP8 + LoRA on routed experts only.

3. Granite Vision 4.1 4B — Apache 2.0

  • HF: https://huggingface.co/ibm-granite/granite-vision-4.1-4b
  • 4B (3.4B LLM + 0.6B vision), SigLIP2-SO400M-patch16-384 encoder + LoRA adapters, 8 vision-to-LLM injection points (LayerDeepstack + SpatialDeepstack)
  • Released 2026-04-29 (today). Direct successor to granite-4.0-3b-vision, the base of our current top 100-eval model (Granite4-Vision-SFT @ 88.25%).
  • Vendor benchmarks emphasize structured extraction: 94.4% zero-shot KVP exact-match on VAREX — directly analogous to our 9-field JSON schema task.
  • Why it could beat our 0.9131:
    • Backward compatible with 4.0 — drop-in retrain of our existing SFT recipe.
    • SpatialDeepstack injection points are designed for fine-grained visual feature retention, helpful for pattern/closure/sleeve discrimination where our Qwen models trail (Qwen3-VL-8B SFT+GRPO: 62% pattern, 42% closure on 100-eval).
    • Tiny footprint (4B) → fast iteration; could SFT in <2h.
  • Action: Highest-leverage candidate. Re-run our existing Granite-4 SFT pipeline against the new 4.1-4b base this week and compare to Granite4-Vision-SFT.

Medium Priority — Worth Watching

4. NVIDIA Nemotron-3-Nano-Omni 30B-A3B Reasoning — NVIDIA Open Model License

  • HF: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning
  • 30B MoE / 3B active, hybrid Mamba/Transformer (NemotronH), any-to-any modality pipeline tag
  • Released 2026-04-20. 9.8K downloads, 130 likes.
  • Architecture is novel (Mamba backbone) — would require vLLM/transformers branch checks.
  • License is "other" (NVIDIA Open Model License) — needs legal review before any production use.
  • Action: Track for 1–2 weeks until community quants and serving recipes mature. Don't invest training time yet.

Low Priority — Tangential

5. Hcompany/Holotron-3-Nano (2026-04-27)

  • HF: https://huggingface.co/Hcompany/Holotron-3-Nano
  • 33B post-train of NVIDIA Nemotron-3-Nano-Omni, specialized for web/computer-use agents — not aimed at static image classification.
  • Same NVIDIA Open Model License gating as #4. Skip for our use case.

6. DINOv3 LVD-1689M finetunes (canvit/*, 2026-04-25)

  • Pure vision encoders (linear classifier probes on ImageNet1K). Not VLMs — would need pairing with an LLM head. Tangential to the JSON-extraction objective.

7. mistralai/Mistral-Small-4-119B-2603-eagle (2026-04-27)

  • No vision pipeline tag, no vision-language tags in the model card. Text-only LLM. Skip.

Summary

Three concrete, actionable candidates dropped in the last 14 days:

Rank Candidate Size Effort Risk Why it matters here
1 ibm-granite/granite-vision-4.1-4b 4B Low (drop-in for existing Granite4 recipe) Low Direct upgrade to our current best 100-eval model
2 Qwen/Qwen3.6-27B-FP8 27B dense Medium (port SFT+GRPO recipe) Low Direct architectural successor to our best 3.5K-eval model
3 Qwen/Qwen3.6-35B-A3B-FP8 35B MoE High (MoE PEFT/GRPO complications) Medium Best raw vision benchmarks of the week, fast inference

Recommended sequence: Granite 4.1-4b → Qwen3.6-27B-FP8 → Qwen3.6-35B-A3B-FP8.

Sign up or log in to comment