Daily Model Scout Report -- 2026-04-04
Daily Model Scout Report -- 2026-04-04
Current Best Models (Denali-AI Eval, 3,500-sample garment classification)
| Rank | Model | weighted_score | Notes |
|---|---|---|---|
| 1 | granite4-vision-sft | 1.0144 | Best overall (Granite 4, custom SFT) |
| 2 | qwen3-vl-8b-sft+grpo | 0.9131 | Best Qwen model |
| 3 | qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 | Best quantized |
| 4 | qwen3-vl-2b-sft-grpo-v9 | 0.8948 | Best small Qwen |
| 5 | qwen3-vl-8b-instruct-base | 0.8751 | Qwen3-VL 8B base |
| 6 | qwen35-2b-base | 0.8437 | Qwen3.5 2B base |
New Models Released (Mar 28 -- Apr 4, 2026)
1. Google Gemma 4 (Released Apr 2, 2026) -- HIGH RELEVANCE
Variants:
- Gemma 4 E2B (2.3B effective / 5.1B total) -- image+text+audio, 128K ctx
- Gemma 4 E4B (4.5B effective / 8B total) -- image+text+audio, 128K ctx
- Gemma 4 26B-A4B MoE (4B active / 26B total) -- image+text+video, 256K ctx
- Gemma 4 31B Dense -- image+text+video, 256K ctx
HuggingFace: google/gemma-4-E2B-it, google/gemma-4-E4B-it, google/gemma-4-26B-A4B-it, google/gemma-4-31B-it
Why it matters:
- Apache 2.0 license (fully open, unlike Qwen's custom license)
- ALL variants are natively multimodal (vision built into architecture, not bolted on)
- MMMU Pro: 76.9%, MATH-Vision: 85.6% (nearly 2x Gemma 3)
- E4B (8B total, 4.5B active) is an ideal candidate for our task -- MoE efficiency with strong vision
- 26B-A4B fits comfortably on our 98GB RTX PRO 6000 and could rival Qwen3-VL-8B
- Community reports it ties or beats Qwen 3.5 27B on vision tasks
- Variable image resolution with configurable token budgets (70-1120 tokens per image)
- Fully supported in TRL for fine-tuning (SFT, DPO, GRPO)
Recommendation: Evaluate Gemma-4-E4B-it and Gemma-4-26B-A4B-it as base models. The MoE architecture means the 26B model activates only 4B params per token -- fast inference, strong accuracy.
2. IBM Granite 4.0 3B Vision (Released Mar 27, 2026) -- HIGH RELEVANCE
HuggingFace: ibm-granite/granite-4.0-3b-vision
Architecture: LoRA adapter (~0.5B) on Granite 4.0 Micro (3.5B dense LLM), Apache 2.0
Why it matters:
- We already have granite4-vision-sft at 1.0144 weighted_score -- THE BEST model in our entire eval!
- 85.5% exact-match accuracy on VAREX (structured form extraction), #3 among 2-4B models
- Purpose-built for structured JSON/HTML extraction from documents
- Very small footprint (3.5B params) -- can run multiple instances on our GPU
- The vLLM-served versions (granite4-vision-sft-vllm) scored 0.4286, suggesting a serving/prompt issue, NOT a model quality issue
Recommendation: HIGH PRIORITY -- Debug the vLLM serving issue for granite4-vision-sft. This model already dominates our eval.
3. Microsoft Phi-4-reasoning-vision-15B (Released Mar 4, 2026) -- MEDIUM RELEVANCE
HuggingFace: microsoft/Phi-4-reasoning-vision-15B
Architecture: 15B params, Phi-4-Reasoning backbone + SigLIP-2 vision encoder, mid-fusion, 16K context
Why it matters:
- Built specifically for visual reasoning with chain-of-thought
- Our Phi-4-multimodal-sft scored only 0.6513, but this is a fundamentally different (and much better) model
- 15B fits on our 98GB GPU easily
- Could be strong on structured attribute extraction with reasoning
Recommendation: Worth evaluating as a base for SFT. The reasoning capabilities could help with harder fields like closure.
4. GLM-5V-Turbo by Z.ai/Zhipu (Released Apr 1, 2026) -- LOW RELEVANCE
HuggingFace: zai-org/GLM-5 (base model only, MIT license)
Architecture: 744B total / 40B active MoE, CogViT vision encoder, native multimodal
Recommendation: Monitor but do not prioritize. Too large for efficient garment classification.
5. Moondream 3 (Preview, ongoing 2026) -- MEDIUM RELEVANCE
HuggingFace: moondream/moondream3-preview
Architecture: 9B total / 2B active MoE, 32K context, SuperBPE tokenizer
Why it matters:
- Our moondream2-base scored 0.6979
- Moondream 3 is a major upgrade with MoE (9B total but only 2B active)
- Extremely efficient -- 2B active params means very fast inference
Recommendation: Re-evaluate once Moondream 3 exits preview.
6. NVIDIA Llama Nemotron Nano VL 8B (Earlier 2026) -- MEDIUM RELEVANCE
HuggingFace: nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1
Architecture: 8B, Llama-3.1-8B-Instruct + C-RADIOv2-VLM-H vision encoder
Why it matters:
- SOTA on OCRBench v2
- FP4 quantized version available
- Optimized for structured extraction from documents
Recommendation: Worth a base eval. OCR strength could help with brand and size fields.
Summary and Priority Actions
| Priority | Action | Expected Impact |
|---|---|---|
| P0 | Debug granite4-vision-sft vLLM serving | Unlock our BEST model (1.0144) for production |
| P1 | Evaluate Gemma 4 E4B-it and 26B-A4B-it as base | New architecture, Apache 2.0, strong vision benchmarks |
| P1 | Evaluate Gemma 4 E4B-it with SFT pipeline | MoE efficiency could match Qwen3-VL-8B at lower compute |
| P2 | Evaluate Phi-4-reasoning-vision-15B as base | Reasoning-focused model may help on harder fields |
| P2 | Evaluate Llama Nemotron Nano VL 8B as base | OCR strength for brand/size extraction |
| P3 | Monitor Moondream 3 for final release | Efficient 2B-active MoE for high-throughput inference |
Key Takeaway: The biggest news this week is Gemma 4 (April 2) and the realization that our Granite 4 Vision SFT model already scores 1.0144 but has a vLLM serving bug. Fixing that serving issue is the single highest-ROI action available right now.
Generated by HF Model Scout -- 2026-04-04