Daily Model Scout Report — 2026-04-29

#18

by msudharsanan - opened 16 days ago

Denali Advanced Integration org 16 days ago

Daily Model Scout Report — 2026-04-29

Scan window: 2026-04-22 → 2026-04-29 (last 7 days).
Filter: image-text-to-text VLMs that could improve garment classification, eval'd against our current best models on the 3,500-sample hard eval set (weighted_score metric):

Current best	Score
`qwen3-vl-8b-sft+grpo`	0.9131
`qwen3-vl-2b-sft-grpo-v9`	0.8948
`qwen3-vl-8b-sft-grpo-nvfp4`	0.8945
`qwen35-2b-base`	0.8437

(Note: granite4-vision-sft shows 1.0144 in the JSON — bad data point, excluded.)

High Priority — Benchmark Immediately

1. Qwen3.6-27B (dense) — Apache 2.0

HF: https://huggingface.co/Qwen/Qwen3.6-27B
FP8 quant: https://huggingface.co/Qwen/Qwen3.6-27B-FP8
27B dense, native VLM (image + video + text), 262K context (1M with YaRN)
Released 2026-04-21. Already 508K downloads, 991 likes in 8 days.
Vision benchmarks (vendor-reported): MMMU 81.7, RealWorldQA 85.3, MMBench 92.8, VideoMMU 83.7
Why it could beat our 0.9131:
- 3.4× larger than our top Qwen3-VL-8B base, same architecture family — our existing SFT+GRPO recipe should port over with minimal changes.
- Dense (not MoE) means GRPO/GTPO trains cleanly without expert-routing complications.
- FP8 weights ≈ 27 GB — fits comfortably on the RTX PRO 6000 98GB with headroom for KV cache + LoRA training.
Action: Run zero-shot eval of Qwen3.6-27B-FP8 on the 3.5K hard set this week to set a baseline; if ≥0.85, kick off SFT.

2. Qwen3.6-35B-A3B (MoE) — Apache 2.0

HF: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
FP8 quant: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
35B total / 3B active MoE (256 experts, 8 routed + 1 shared), native VLM, hybrid Gated DeltaNet + Gated Attention layers
Released 2026-04-15. 1.5M downloads, 1507 likes — most downloaded recent VLM by a wide margin.
Vision benchmarks comparable to 27B dense; similar VL benchmarks
Why it could beat our 0.9131:
- Inference cost ≈ 3B-active model → potentially faster than our 8B at higher quality.
- Apache-2.0, FP8 already published, plus community NVFP4/MLX quants exist (e.g. igf-oeaw/Qwen3.6-27B-NVFP4A16-VL-MTP).
Caveats:
- MoE PEFT/GRPO is trickier (expert balance loss, router stability). Worth piloting only after the 27B dense run shows lift.
- 35B FP8 ≈ 35GB still fits on 98GB but leaves less headroom for 32K image-text training contexts.
Action: Defer until 27B dense is benchmarked. If 27B shows lift, attempt MoE run with FP8 + LoRA on routed experts only.

3. Granite Vision 4.1 4B — Apache 2.0

HF: https://huggingface.co/ibm-granite/granite-vision-4.1-4b
4B (3.4B LLM + 0.6B vision), SigLIP2-SO400M-patch16-384 encoder + LoRA adapters, 8 vision-to-LLM injection points (LayerDeepstack + SpatialDeepstack)
Released 2026-04-29 (today). Direct successor to granite-4.0-3b-vision, the base of our current top 100-eval model (Granite4-Vision-SFT @ 88.25%).
Vendor benchmarks emphasize structured extraction: 94.4% zero-shot KVP exact-match on VAREX — directly analogous to our 9-field JSON schema task.
Why it could beat our 0.9131:
- Backward compatible with 4.0 — drop-in retrain of our existing SFT recipe.
- SpatialDeepstack injection points are designed for fine-grained visual feature retention, helpful for pattern/closure/sleeve discrimination where our Qwen models trail (Qwen3-VL-8B SFT+GRPO: 62% pattern, 42% closure on 100-eval).
- Tiny footprint (4B) → fast iteration; could SFT in <2h.
Action: Highest-leverage candidate. Re-run our existing Granite-4 SFT pipeline against the new 4.1-4b base this week and compare to Granite4-Vision-SFT.

Medium Priority — Worth Watching

4. NVIDIA Nemotron-3-Nano-Omni 30B-A3B Reasoning — NVIDIA Open Model License

HF: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning
30B MoE / 3B active, hybrid Mamba/Transformer (NemotronH), any-to-any modality pipeline tag
Released 2026-04-20. 9.8K downloads, 130 likes.
Architecture is novel (Mamba backbone) — would require vLLM/transformers branch checks.
License is "other" (NVIDIA Open Model License) — needs legal review before any production use.
Action: Track for 1–2 weeks until community quants and serving recipes mature. Don't invest training time yet.

Low Priority — Tangential

5. Hcompany/Holotron-3-Nano (2026-04-27)

HF: https://huggingface.co/Hcompany/Holotron-3-Nano
33B post-train of NVIDIA Nemotron-3-Nano-Omni, specialized for web/computer-use agents — not aimed at static image classification.
Same NVIDIA Open Model License gating as #4. Skip for our use case.

6. DINOv3 LVD-1689M finetunes (canvit/*, 2026-04-25)

Pure vision encoders (linear classifier probes on ImageNet1K). Not VLMs — would need pairing with an LLM head. Tangential to the JSON-extraction objective.

7. mistralai/Mistral-Small-4-119B-2603-eagle (2026-04-27)

No vision pipeline tag, no vision-language tags in the model card. Text-only LLM. Skip.

Summary

Three concrete, actionable candidates dropped in the last 14 days:

Rank	Candidate	Size	Effort	Risk	Why it matters here
1	`ibm-granite/granite-vision-4.1-4b`	4B	Low (drop-in for existing Granite4 recipe)	Low	Direct upgrade to our current best 100-eval model
2	`Qwen/Qwen3.6-27B-FP8`	27B dense	Medium (port SFT+GRPO recipe)	Low	Direct architectural successor to our best 3.5K-eval model
3	`Qwen/Qwen3.6-35B-A3B-FP8`	35B MoE	High (MoE PEFT/GRPO complications)	Medium	Best raw vision benchmarks of the week, fast inference

Recommended sequence: Granite 4.1-4b → Qwen3.6-27B-FP8 → Qwen3.6-35B-A3B-FP8.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment