Daily Model Scout Report — 2026-04-30
#19
by msudharsanan - opened
Daily Model Scout Report — 2026-04-30
Survey of new VLM releases on HuggingFace from 2026-04-23 → 2026-04-30 (last 7 days), evaluated for relevance to our 9-field garment attribute classification task. Compared against current best models on the 3,500-sample hard eval set (weighted score):
| Current best | Weighted score |
|---|---|
| qwen3-vl-8b-sft+grpo | 0.9131 |
| qwen3-vl-2b-sft-grpo-v9 | 0.8948 |
| qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 |
| qwen35-2b-base | 0.8437 |
HIGH relevance — benchmark immediately
1. Qwen/Qwen3.6-27B
- Link: https://huggingface.co/Qwen/Qwen3.6-27B
- Architecture: Dense Qwen3.6-VL, 27B params, hybrid Gated DeltaNet + Gated Attention, vision encoder, 256K native context (1M with YaRN)
- License: Apache 2.0 — released 2026-04-21
- Why it may beat current best:
- Official first open-weight Qwen3.6 (direct successor to Qwen3.5/Qwen3-VL families that produce all our top 4 scores)
- Reports MMMU 82.9, RefCOCO avg 92.5, V* 94.7 — meaningfully above Qwen3-VL-8B baselines
- 27B dense fits on RTX PRO 6000 98GB at BF16 (~54GB) with room for SFT/GRPO
- Same Qwen-VL processor → minimal pipeline plumbing to swap in
- Risk: ~3.4× larger than current production 8B → slower training and inference; quantization (NVFP4 / FP8) likely required for serve
- Action: SFT on our 7,672-row apparel-capture-8k → eval on 3,500 hard set
2. Qwen/Qwen3.6-35B-A3B
- Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
- Architecture: MoE Qwen3.6-VL, 35B total / 3B active, 256 experts (8 routed + 1 shared), hybrid Mamba-style + attention, vision encoder
- License: Apache 2.0 — released 2026-04-15
- Why it may beat current best:
- 3B active means inference cost similar to our 2B class while drawing on 35B capacity
- Same Qwen3.6 vision stack as #1 — best-in-class vision benchmarks
- Excellent fit for the 98GB GPU (full BF16 ~70GB; NVFP4 ~22GB)
- Community has already shipped FP8 / NVFP4 / GPTQ-Int4 / MLX-VL variants in the past 7 days — vLLM serving path is unblocked
- Risk: MoE + LoRA SFT is fiddlier than dense; routing may interact poorly with our narrow JSON-output task
- Action: SFT-then-GRPO at small scale; if competitive with qwen3-vl-8b-sft+grpo on hard eval, scale up
MEDIUM relevance — worth watching / spot-test
3. nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning
- Link: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 (also FP8 / NVFP4)
- Architecture: Mamba2-Transformer Hybrid MoE, 31B total / 3B active, CRADIO-v4-H vision encoder, Parakeet audio encoder
- License: NVIDIA Open Model Agreement (commercial OK) — released 2026-04-28
- Why interesting:
- Native JSON output + tool-calling + reasoning mode — could match our structured 9-field extraction task very directly
- Reasoning lifts hard-sample accuracy on similar tasks (Charxiv +35%, OCRBenchV2 +18% over predecessor)
- NVFP4 fits in ~21GB — leaves room for bigger batches than our current 8B FP8 setup
- Risk: Heavier omni-modal pretrain may not transfer to a narrow vision-only task; non-Apache license; Mamba/MoE training recipe is less battle-tested in our pipeline
- Action: Zero-shot eval on 100-sample first to gauge baseline before committing to SFT
4. ibm-granite/granite-4.0-3b-vision
- Link: https://huggingface.co/ibm-granite/granite-4.0-3b-vision
- Architecture: Granite 4.0 Micro 3.5B + 0.5B LoRA + SigLIP2 vision encoder + Window Q-Former w/ 4× compression
- License: Apache 2.0 — refreshed 2026-04-30
- Why interesting:
- Specifically designed for structured extraction (chart→CSV, table→JSON, KVP extraction) — same shape as our task
- Only ~4B params → cheap SFT, fast inference
- This is the upstream of our existing
granite4-vision-sft(which has the suspicious 1.0144 score that should be re-validated). Re-baselining against the official upstream will tell us whether the in-house variant truly outperforms or whether the eval is broken
- Risk: Skewed toward documents/charts; garment imagery may be out-of-distribution for the SigLIP2 encoder's fine-tuning
- Action: Re-run eval on stock Granite 4.0 3B Vision to validate our in-house granite4-vision-sft score
5. nvidia/Cosmos-Reason2-8B
- Link: https://huggingface.co/nvidia/Cosmos-Reason2-8B
- Architecture: Built on top of Qwen3-VL-8B-Instruct; ViT + dense LLM, 8.7B params
- License: NVIDIA Open Model License (Apache-2.0-derived, commercial OK) — refreshed 2026-04-30
- Why interesting:
- Same backbone as our current best (qwen3-vl-8b-sft+grpo @ 0.9131) → drop-in replacement starting point
- NVIDIA reports +1.75 / +3.82 / +21.5 / +27.3 pts over Qwen3-VL-8B on physical-AI categories — improvements likely come from better spatial/object reasoning, which could transfer to closure/sleeve/neckline fields where we still have headroom
- Risk: Optimized for video/embodied reasoning; the spatial gains may not lift our text-attribute extraction; 2B variant available too
- Action: Quick zero-shot 100-sample probe; SFT only if delta vs Qwen3-VL-8B base is positive
6. google/gemma-4-E4B-it (and gemma-4-31B-it)
- Link: https://huggingface.co/google/gemma-4-E4B-it , https://huggingface.co/google/gemma-4-31B-it
- Architecture: PLE (per-layer embeddings), hybrid local/global attention, ~150M vision encoder; E4B = 4.5B effective / 8B total; 31B dense variant also available
- License: Apache 2.0 — refreshed 2026-04-28 (originally Mar 2026)
- Why interesting:
- E4B at 4.5B-effective could match or beat our 2B-class model with less inference cost than qwen3-vl-8b
- Native multilingual (140+) — useful if Nike ReLo expands to non-English brand text
- Edge-optimized variants (E2B at 2.3B effective) for future on-device deployment
- Risk: Less prior art on JSON-extraction fine-tuning vs Qwen-VL; may need more SFT data to stabilize structured output
- Action: Lower priority than Qwen3.6 path; revisit if Qwen3.6-VL-8B/4B doesn't ship in the next 1-2 weeks
LOW relevance
| Model | Reason |
|---|---|
| nvidia/Cosmos-Reason2-2B | Physical-AI specialization; 2B-class already covered by qwen3-vl-2b-sft-grpo-v9 (0.8948) |
| nvidia/nemotron-ocr-v2 | OCR-only specialist; not a general VLM |
Qwen3.6 community quants/distills (NVFP4, MLX, AWQ, abliterated, REAP-pruned variants from RedHatAI, deepsweet, wangkezun, nightmedia, froggeric, etc.) |
Derivative repackages of #1/#2 — useful only after we've validated the base model |
Notable absences (checked, not yet released)
- Qwen3.6-VL-8B / 4B / 2B — only 27B dense and 35B-A3B MoE are published as of 2026-04-30. Smaller VL variants are the obvious next drop and would be the highest-priority candidate when they appear.
- InternVL4 / InternVL3.5 — no new public releases this week.
- PaliGemma 3 / Florence-3 / SmolVLM 3 / MiniCPM-V-4 — no new public releases this week.
- Pixtral / Llama-4-Vision — no new public releases this week.
Recommended next actions (ranked)
- SFT Qwen3.6-27B on apparel-capture-8k → eval on 3,500 hard set. Highest probability of beating 0.9131.
- Zero-shot Qwen3.6-35B-A3B on 100-sample → if competitive, run SFT+GRPO. Could match #1 at lower inference cost.
- Zero-shot Cosmos-Reason2-8B on 100-sample → cheap probe; same backbone as our current best.
- Re-eval stock granite-4.0-3b-vision on the 3,500-sample set to validate the suspicious 1.0144 score on our in-house granite4-vision-sft.
Report generated 2026-04-30. Search covered HF API across major VLM orgs (Qwen, Google, Microsoft, IBM, NVIDIA, Allen AI, OpenGVLab, HuggingFaceTB, OpenBMB, THUDM, Moonshot, DeepSeek, Apple, Meta, Mistral, Salesforce, Stepfun, Vikhyatk, Rhymes-AI) plus targeted name searches across 25+ VLM family keywords.