Daily Model Scout Report — 2026-04-30

#19

by msudharsanan - opened 15 days ago

Denali Advanced Integration org 15 days ago

Daily Model Scout Report — 2026-04-30

Survey of new VLM releases on HuggingFace from 2026-04-23 → 2026-04-30 (last 7 days), evaluated for relevance to our 9-field garment attribute classification task. Compared against current best models on the 3,500-sample hard eval set (weighted score):

Current best	Weighted score
qwen3-vl-8b-sft+grpo	0.9131
qwen3-vl-2b-sft-grpo-v9	0.8948
qwen3-vl-8b-sft-grpo-nvfp4	0.8945
qwen35-2b-base	0.8437

HIGH relevance — benchmark immediately

1. Qwen/Qwen3.6-27B

Link: https://huggingface.co/Qwen/Qwen3.6-27B
Architecture: Dense Qwen3.6-VL, 27B params, hybrid Gated DeltaNet + Gated Attention, vision encoder, 256K native context (1M with YaRN)
License: Apache 2.0 — released 2026-04-21
Why it may beat current best:
- Official first open-weight Qwen3.6 (direct successor to Qwen3.5/Qwen3-VL families that produce all our top 4 scores)
- Reports MMMU 82.9, RefCOCO avg 92.5, V* 94.7 — meaningfully above Qwen3-VL-8B baselines
- 27B dense fits on RTX PRO 6000 98GB at BF16 (~54GB) with room for SFT/GRPO
- Same Qwen-VL processor → minimal pipeline plumbing to swap in
Risk: ~3.4× larger than current production 8B → slower training and inference; quantization (NVFP4 / FP8) likely required for serve
Action: SFT on our 7,672-row apparel-capture-8k → eval on 3,500 hard set

2. Qwen/Qwen3.6-35B-A3B

Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Architecture: MoE Qwen3.6-VL, 35B total / 3B active, 256 experts (8 routed + 1 shared), hybrid Mamba-style + attention, vision encoder
License: Apache 2.0 — released 2026-04-15
Why it may beat current best:
- 3B active means inference cost similar to our 2B class while drawing on 35B capacity
- Same Qwen3.6 vision stack as #1 — best-in-class vision benchmarks
- Excellent fit for the 98GB GPU (full BF16 ~70GB; NVFP4 ~22GB)
- Community has already shipped FP8 / NVFP4 / GPTQ-Int4 / MLX-VL variants in the past 7 days — vLLM serving path is unblocked
Risk: MoE + LoRA SFT is fiddlier than dense; routing may interact poorly with our narrow JSON-output task
Action: SFT-then-GRPO at small scale; if competitive with qwen3-vl-8b-sft+grpo on hard eval, scale up

MEDIUM relevance — worth watching / spot-test

3. nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning

Link: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 (also FP8 / NVFP4)
Architecture: Mamba2-Transformer Hybrid MoE, 31B total / 3B active, CRADIO-v4-H vision encoder, Parakeet audio encoder
License: NVIDIA Open Model Agreement (commercial OK) — released 2026-04-28
Why interesting:
- Native JSON output + tool-calling + reasoning mode — could match our structured 9-field extraction task very directly
- Reasoning lifts hard-sample accuracy on similar tasks (Charxiv +35%, OCRBenchV2 +18% over predecessor)
- NVFP4 fits in ~21GB — leaves room for bigger batches than our current 8B FP8 setup
Risk: Heavier omni-modal pretrain may not transfer to a narrow vision-only task; non-Apache license; Mamba/MoE training recipe is less battle-tested in our pipeline
Action: Zero-shot eval on 100-sample first to gauge baseline before committing to SFT

4. ibm-granite/granite-4.0-3b-vision

Link: https://huggingface.co/ibm-granite/granite-4.0-3b-vision
Architecture: Granite 4.0 Micro 3.5B + 0.5B LoRA + SigLIP2 vision encoder + Window Q-Former w/ 4× compression
License: Apache 2.0 — refreshed 2026-04-30
Why interesting:
- Specifically designed for structured extraction (chart→CSV, table→JSON, KVP extraction) — same shape as our task
- Only ~4B params → cheap SFT, fast inference
- This is the upstream of our existing granite4-vision-sft (which has the suspicious 1.0144 score that should be re-validated). Re-baselining against the official upstream will tell us whether the in-house variant truly outperforms or whether the eval is broken
Risk: Skewed toward documents/charts; garment imagery may be out-of-distribution for the SigLIP2 encoder's fine-tuning
Action: Re-run eval on stock Granite 4.0 3B Vision to validate our in-house granite4-vision-sft score

5. nvidia/Cosmos-Reason2-8B

Link: https://huggingface.co/nvidia/Cosmos-Reason2-8B
Architecture: Built on top of Qwen3-VL-8B-Instruct; ViT + dense LLM, 8.7B params
License: NVIDIA Open Model License (Apache-2.0-derived, commercial OK) — refreshed 2026-04-30
Why interesting:
- Same backbone as our current best (qwen3-vl-8b-sft+grpo @ 0.9131) → drop-in replacement starting point
- NVIDIA reports +1.75 / +3.82 / +21.5 / +27.3 pts over Qwen3-VL-8B on physical-AI categories — improvements likely come from better spatial/object reasoning, which could transfer to closure/sleeve/neckline fields where we still have headroom
Risk: Optimized for video/embodied reasoning; the spatial gains may not lift our text-attribute extraction; 2B variant available too
Action: Quick zero-shot 100-sample probe; SFT only if delta vs Qwen3-VL-8B base is positive

6. google/gemma-4-E4B-it (and gemma-4-31B-it)

Link: https://huggingface.co/google/gemma-4-E4B-it , https://huggingface.co/google/gemma-4-31B-it
Architecture: PLE (per-layer embeddings), hybrid local/global attention, ~150M vision encoder; E4B = 4.5B effective / 8B total; 31B dense variant also available
License: Apache 2.0 — refreshed 2026-04-28 (originally Mar 2026)
Why interesting:
- E4B at 4.5B-effective could match or beat our 2B-class model with less inference cost than qwen3-vl-8b
- Native multilingual (140+) — useful if Nike ReLo expands to non-English brand text
- Edge-optimized variants (E2B at 2.3B effective) for future on-device deployment
Risk: Less prior art on JSON-extraction fine-tuning vs Qwen-VL; may need more SFT data to stabilize structured output
Action: Lower priority than Qwen3.6 path; revisit if Qwen3.6-VL-8B/4B doesn't ship in the next 1-2 weeks

LOW relevance

Model	Reason
nvidia/Cosmos-Reason2-2B	Physical-AI specialization; 2B-class already covered by qwen3-vl-2b-sft-grpo-v9 (0.8948)
nvidia/nemotron-ocr-v2	OCR-only specialist; not a general VLM
Qwen3.6 community quants/distills (NVFP4, MLX, AWQ, abliterated, REAP-pruned variants from `RedHatAI`, `deepsweet`, `wangkezun`, `nightmedia`, `froggeric`, etc.)	Derivative repackages of #1/#2 — useful only after we've validated the base model

Notable absences (checked, not yet released)

Qwen3.6-VL-8B / 4B / 2B — only 27B dense and 35B-A3B MoE are published as of 2026-04-30. Smaller VL variants are the obvious next drop and would be the highest-priority candidate when they appear.
InternVL4 / InternVL3.5 — no new public releases this week.
PaliGemma 3 / Florence-3 / SmolVLM 3 / MiniCPM-V-4 — no new public releases this week.
Pixtral / Llama-4-Vision — no new public releases this week.

Recommended next actions (ranked)

SFT Qwen3.6-27B on apparel-capture-8k → eval on 3,500 hard set. Highest probability of beating 0.9131.
Zero-shot Qwen3.6-35B-A3B on 100-sample → if competitive, run SFT+GRPO. Could match #1 at lower inference cost.
Zero-shot Cosmos-Reason2-8B on 100-sample → cheap probe; same backbone as our current best.
Re-eval stock granite-4.0-3b-vision on the 3,500-sample set to validate the suspicious 1.0144 score on our in-house granite4-vision-sft.

Report generated 2026-04-30. Search covered HF API across major VLM orgs (Qwen, Google, Microsoft, IBM, NVIDIA, Allen AI, OpenGVLab, HuggingFaceTB, OpenBMB, THUDM, Moonshot, DeepSeek, Apple, Meta, Mistral, Salesforce, Stepfun, Vikhyatk, Rhymes-AI) plus targeted name searches across 25+ VLM family keywords.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment