Daily Model Scout Report -- 2026-04-04

#5
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report -- 2026-04-04

Current Best Models (Denali-AI Eval, 3,500-sample garment classification)

Rank Model weighted_score Notes
1 granite4-vision-sft 1.0144 Best overall (Granite 4, custom SFT)
2 qwen3-vl-8b-sft+grpo 0.9131 Best Qwen model
3 qwen3-vl-8b-sft-grpo-nvfp4 0.8945 Best quantized
4 qwen3-vl-2b-sft-grpo-v9 0.8948 Best small Qwen
5 qwen3-vl-8b-instruct-base 0.8751 Qwen3-VL 8B base
6 qwen35-2b-base 0.8437 Qwen3.5 2B base

New Models Released (Mar 28 -- Apr 4, 2026)

1. Google Gemma 4 (Released Apr 2, 2026) -- HIGH RELEVANCE

Variants:

  • Gemma 4 E2B (2.3B effective / 5.1B total) -- image+text+audio, 128K ctx
  • Gemma 4 E4B (4.5B effective / 8B total) -- image+text+audio, 128K ctx
  • Gemma 4 26B-A4B MoE (4B active / 26B total) -- image+text+video, 256K ctx
  • Gemma 4 31B Dense -- image+text+video, 256K ctx

HuggingFace: google/gemma-4-E2B-it, google/gemma-4-E4B-it, google/gemma-4-26B-A4B-it, google/gemma-4-31B-it

Why it matters:

  • Apache 2.0 license (fully open, unlike Qwen's custom license)
  • ALL variants are natively multimodal (vision built into architecture, not bolted on)
  • MMMU Pro: 76.9%, MATH-Vision: 85.6% (nearly 2x Gemma 3)
  • E4B (8B total, 4.5B active) is an ideal candidate for our task -- MoE efficiency with strong vision
  • 26B-A4B fits comfortably on our 98GB RTX PRO 6000 and could rival Qwen3-VL-8B
  • Community reports it ties or beats Qwen 3.5 27B on vision tasks
  • Variable image resolution with configurable token budgets (70-1120 tokens per image)
  • Fully supported in TRL for fine-tuning (SFT, DPO, GRPO)

Recommendation: Evaluate Gemma-4-E4B-it and Gemma-4-26B-A4B-it as base models. The MoE architecture means the 26B model activates only 4B params per token -- fast inference, strong accuracy.


2. IBM Granite 4.0 3B Vision (Released Mar 27, 2026) -- HIGH RELEVANCE

HuggingFace: ibm-granite/granite-4.0-3b-vision

Architecture: LoRA adapter (~0.5B) on Granite 4.0 Micro (3.5B dense LLM), Apache 2.0

Why it matters:

  • We already have granite4-vision-sft at 1.0144 weighted_score -- THE BEST model in our entire eval!
  • 85.5% exact-match accuracy on VAREX (structured form extraction), #3 among 2-4B models
  • Purpose-built for structured JSON/HTML extraction from documents
  • Very small footprint (3.5B params) -- can run multiple instances on our GPU
  • The vLLM-served versions (granite4-vision-sft-vllm) scored 0.4286, suggesting a serving/prompt issue, NOT a model quality issue

Recommendation: HIGH PRIORITY -- Debug the vLLM serving issue for granite4-vision-sft. This model already dominates our eval.


3. Microsoft Phi-4-reasoning-vision-15B (Released Mar 4, 2026) -- MEDIUM RELEVANCE

HuggingFace: microsoft/Phi-4-reasoning-vision-15B

Architecture: 15B params, Phi-4-Reasoning backbone + SigLIP-2 vision encoder, mid-fusion, 16K context

Why it matters:

  • Built specifically for visual reasoning with chain-of-thought
  • Our Phi-4-multimodal-sft scored only 0.6513, but this is a fundamentally different (and much better) model
  • 15B fits on our 98GB GPU easily
  • Could be strong on structured attribute extraction with reasoning

Recommendation: Worth evaluating as a base for SFT. The reasoning capabilities could help with harder fields like closure.


4. GLM-5V-Turbo by Z.ai/Zhipu (Released Apr 1, 2026) -- LOW RELEVANCE

HuggingFace: zai-org/GLM-5 (base model only, MIT license)

Architecture: 744B total / 40B active MoE, CogViT vision encoder, native multimodal

Recommendation: Monitor but do not prioritize. Too large for efficient garment classification.


5. Moondream 3 (Preview, ongoing 2026) -- MEDIUM RELEVANCE

HuggingFace: moondream/moondream3-preview

Architecture: 9B total / 2B active MoE, 32K context, SuperBPE tokenizer

Why it matters:

  • Our moondream2-base scored 0.6979
  • Moondream 3 is a major upgrade with MoE (9B total but only 2B active)
  • Extremely efficient -- 2B active params means very fast inference

Recommendation: Re-evaluate once Moondream 3 exits preview.


6. NVIDIA Llama Nemotron Nano VL 8B (Earlier 2026) -- MEDIUM RELEVANCE

HuggingFace: nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1

Architecture: 8B, Llama-3.1-8B-Instruct + C-RADIOv2-VLM-H vision encoder

Why it matters:

  • SOTA on OCRBench v2
  • FP4 quantized version available
  • Optimized for structured extraction from documents

Recommendation: Worth a base eval. OCR strength could help with brand and size fields.


Summary and Priority Actions

Priority Action Expected Impact
P0 Debug granite4-vision-sft vLLM serving Unlock our BEST model (1.0144) for production
P1 Evaluate Gemma 4 E4B-it and 26B-A4B-it as base New architecture, Apache 2.0, strong vision benchmarks
P1 Evaluate Gemma 4 E4B-it with SFT pipeline MoE efficiency could match Qwen3-VL-8B at lower compute
P2 Evaluate Phi-4-reasoning-vision-15B as base Reasoning-focused model may help on harder fields
P2 Evaluate Llama Nemotron Nano VL 8B as base OCR strength for brand/size extraction
P3 Monitor Moondream 3 for final release Efficient 2B-active MoE for high-throughput inference

Key Takeaway: The biggest news this week is Gemma 4 (April 2) and the realization that our Granite 4 Vision SFT model already scores 1.0144 but has a vLLM serving bug. Fixing that serving issue is the single highest-ROI action available right now.


Generated by HF Model Scout -- 2026-04-04

Sign up or log in to comment