Daily Model Scout Report — 2026-04-30

#19
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report — 2026-04-30

Survey of new VLM releases on HuggingFace from 2026-04-23 → 2026-04-30 (last 7 days), evaluated for relevance to our 9-field garment attribute classification task. Compared against current best models on the 3,500-sample hard eval set (weighted score):

Current best Weighted score
qwen3-vl-8b-sft+grpo 0.9131
qwen3-vl-2b-sft-grpo-v9 0.8948
qwen3-vl-8b-sft-grpo-nvfp4 0.8945
qwen35-2b-base 0.8437

HIGH relevance — benchmark immediately

1. Qwen/Qwen3.6-27B

  • Link: https://huggingface.co/Qwen/Qwen3.6-27B
  • Architecture: Dense Qwen3.6-VL, 27B params, hybrid Gated DeltaNet + Gated Attention, vision encoder, 256K native context (1M with YaRN)
  • License: Apache 2.0 — released 2026-04-21
  • Why it may beat current best:
    • Official first open-weight Qwen3.6 (direct successor to Qwen3.5/Qwen3-VL families that produce all our top 4 scores)
    • Reports MMMU 82.9, RefCOCO avg 92.5, V* 94.7 — meaningfully above Qwen3-VL-8B baselines
    • 27B dense fits on RTX PRO 6000 98GB at BF16 (~54GB) with room for SFT/GRPO
    • Same Qwen-VL processor → minimal pipeline plumbing to swap in
  • Risk: ~3.4× larger than current production 8B → slower training and inference; quantization (NVFP4 / FP8) likely required for serve
  • Action: SFT on our 7,672-row apparel-capture-8k → eval on 3,500 hard set

2. Qwen/Qwen3.6-35B-A3B

  • Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
  • Architecture: MoE Qwen3.6-VL, 35B total / 3B active, 256 experts (8 routed + 1 shared), hybrid Mamba-style + attention, vision encoder
  • License: Apache 2.0 — released 2026-04-15
  • Why it may beat current best:
    • 3B active means inference cost similar to our 2B class while drawing on 35B capacity
    • Same Qwen3.6 vision stack as #1 — best-in-class vision benchmarks
    • Excellent fit for the 98GB GPU (full BF16 ~70GB; NVFP4 ~22GB)
    • Community has already shipped FP8 / NVFP4 / GPTQ-Int4 / MLX-VL variants in the past 7 days — vLLM serving path is unblocked
  • Risk: MoE + LoRA SFT is fiddlier than dense; routing may interact poorly with our narrow JSON-output task
  • Action: SFT-then-GRPO at small scale; if competitive with qwen3-vl-8b-sft+grpo on hard eval, scale up

MEDIUM relevance — worth watching / spot-test

3. nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning

  • Link: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 (also FP8 / NVFP4)
  • Architecture: Mamba2-Transformer Hybrid MoE, 31B total / 3B active, CRADIO-v4-H vision encoder, Parakeet audio encoder
  • License: NVIDIA Open Model Agreement (commercial OK) — released 2026-04-28
  • Why interesting:
    • Native JSON output + tool-calling + reasoning mode — could match our structured 9-field extraction task very directly
    • Reasoning lifts hard-sample accuracy on similar tasks (Charxiv +35%, OCRBenchV2 +18% over predecessor)
    • NVFP4 fits in ~21GB — leaves room for bigger batches than our current 8B FP8 setup
  • Risk: Heavier omni-modal pretrain may not transfer to a narrow vision-only task; non-Apache license; Mamba/MoE training recipe is less battle-tested in our pipeline
  • Action: Zero-shot eval on 100-sample first to gauge baseline before committing to SFT

4. ibm-granite/granite-4.0-3b-vision

  • Link: https://huggingface.co/ibm-granite/granite-4.0-3b-vision
  • Architecture: Granite 4.0 Micro 3.5B + 0.5B LoRA + SigLIP2 vision encoder + Window Q-Former w/ 4× compression
  • License: Apache 2.0 — refreshed 2026-04-30
  • Why interesting:
    • Specifically designed for structured extraction (chart→CSV, table→JSON, KVP extraction) — same shape as our task
    • Only ~4B params → cheap SFT, fast inference
    • This is the upstream of our existing granite4-vision-sft (which has the suspicious 1.0144 score that should be re-validated). Re-baselining against the official upstream will tell us whether the in-house variant truly outperforms or whether the eval is broken
  • Risk: Skewed toward documents/charts; garment imagery may be out-of-distribution for the SigLIP2 encoder's fine-tuning
  • Action: Re-run eval on stock Granite 4.0 3B Vision to validate our in-house granite4-vision-sft score

5. nvidia/Cosmos-Reason2-8B

  • Link: https://huggingface.co/nvidia/Cosmos-Reason2-8B
  • Architecture: Built on top of Qwen3-VL-8B-Instruct; ViT + dense LLM, 8.7B params
  • License: NVIDIA Open Model License (Apache-2.0-derived, commercial OK) — refreshed 2026-04-30
  • Why interesting:
    • Same backbone as our current best (qwen3-vl-8b-sft+grpo @ 0.9131) → drop-in replacement starting point
    • NVIDIA reports +1.75 / +3.82 / +21.5 / +27.3 pts over Qwen3-VL-8B on physical-AI categories — improvements likely come from better spatial/object reasoning, which could transfer to closure/sleeve/neckline fields where we still have headroom
  • Risk: Optimized for video/embodied reasoning; the spatial gains may not lift our text-attribute extraction; 2B variant available too
  • Action: Quick zero-shot 100-sample probe; SFT only if delta vs Qwen3-VL-8B base is positive

6. google/gemma-4-E4B-it (and gemma-4-31B-it)

  • Link: https://huggingface.co/google/gemma-4-E4B-it , https://huggingface.co/google/gemma-4-31B-it
  • Architecture: PLE (per-layer embeddings), hybrid local/global attention, ~150M vision encoder; E4B = 4.5B effective / 8B total; 31B dense variant also available
  • License: Apache 2.0 — refreshed 2026-04-28 (originally Mar 2026)
  • Why interesting:
    • E4B at 4.5B-effective could match or beat our 2B-class model with less inference cost than qwen3-vl-8b
    • Native multilingual (140+) — useful if Nike ReLo expands to non-English brand text
    • Edge-optimized variants (E2B at 2.3B effective) for future on-device deployment
  • Risk: Less prior art on JSON-extraction fine-tuning vs Qwen-VL; may need more SFT data to stabilize structured output
  • Action: Lower priority than Qwen3.6 path; revisit if Qwen3.6-VL-8B/4B doesn't ship in the next 1-2 weeks

LOW relevance

Model Reason
nvidia/Cosmos-Reason2-2B Physical-AI specialization; 2B-class already covered by qwen3-vl-2b-sft-grpo-v9 (0.8948)
nvidia/nemotron-ocr-v2 OCR-only specialist; not a general VLM
Qwen3.6 community quants/distills (NVFP4, MLX, AWQ, abliterated, REAP-pruned variants from RedHatAI, deepsweet, wangkezun, nightmedia, froggeric, etc.) Derivative repackages of #1/#2 — useful only after we've validated the base model

Notable absences (checked, not yet released)

  • Qwen3.6-VL-8B / 4B / 2B — only 27B dense and 35B-A3B MoE are published as of 2026-04-30. Smaller VL variants are the obvious next drop and would be the highest-priority candidate when they appear.
  • InternVL4 / InternVL3.5 — no new public releases this week.
  • PaliGemma 3 / Florence-3 / SmolVLM 3 / MiniCPM-V-4 — no new public releases this week.
  • Pixtral / Llama-4-Vision — no new public releases this week.

Recommended next actions (ranked)

  1. SFT Qwen3.6-27B on apparel-capture-8k → eval on 3,500 hard set. Highest probability of beating 0.9131.
  2. Zero-shot Qwen3.6-35B-A3B on 100-sample → if competitive, run SFT+GRPO. Could match #1 at lower inference cost.
  3. Zero-shot Cosmos-Reason2-8B on 100-sample → cheap probe; same backbone as our current best.
  4. Re-eval stock granite-4.0-3b-vision on the 3,500-sample set to validate the suspicious 1.0144 score on our in-house granite4-vision-sft.

Report generated 2026-04-30. Search covered HF API across major VLM orgs (Qwen, Google, Microsoft, IBM, NVIDIA, Allen AI, OpenGVLab, HuggingFaceTB, OpenBMB, THUDM, Moonshot, DeepSeek, Apple, Meta, Mistral, Salesforce, Stepfun, Vikhyatk, Rhymes-AI) plus targeted name searches across 25+ VLM family keywords.

Sign up or log in to comment