Daily Model Scout Report -- 2026-04-04

by msudharsanan - opened Apr 4

Denali Advanced Integration org Apr 4

Daily Model Scout Report -- 2026-04-04

Current Best Models (Denali-AI Eval, 3,500-sample garment classification)

Rank	Model	weighted_score	Notes
1	granite4-vision-sft	1.0144	Best overall (Granite 4, custom SFT)
2	qwen3-vl-8b-sft+grpo	0.9131	Best Qwen model
3	qwen3-vl-8b-sft-grpo-nvfp4	0.8945	Best quantized
4	qwen3-vl-2b-sft-grpo-v9	0.8948	Best small Qwen
5	qwen3-vl-8b-instruct-base	0.8751	Qwen3-VL 8B base
6	qwen35-2b-base	0.8437	Qwen3.5 2B base

New Models Released (Mar 28 -- Apr 4, 2026)

1. Google Gemma 4 (Released Apr 2, 2026) -- HIGH RELEVANCE

Variants:

Gemma 4 E2B (2.3B effective / 5.1B total) -- image+text+audio, 128K ctx
Gemma 4 E4B (4.5B effective / 8B total) -- image+text+audio, 128K ctx
Gemma 4 26B-A4B MoE (4B active / 26B total) -- image+text+video, 256K ctx
Gemma 4 31B Dense -- image+text+video, 256K ctx

HuggingFace: google/gemma-4-E2B-it, google/gemma-4-E4B-it, google/gemma-4-26B-A4B-it, google/gemma-4-31B-it

Why it matters:

Apache 2.0 license (fully open, unlike Qwen's custom license)
ALL variants are natively multimodal (vision built into architecture, not bolted on)
MMMU Pro: 76.9%, MATH-Vision: 85.6% (nearly 2x Gemma 3)
E4B (8B total, 4.5B active) is an ideal candidate for our task -- MoE efficiency with strong vision
26B-A4B fits comfortably on our 98GB RTX PRO 6000 and could rival Qwen3-VL-8B
Community reports it ties or beats Qwen 3.5 27B on vision tasks
Variable image resolution with configurable token budgets (70-1120 tokens per image)
Fully supported in TRL for fine-tuning (SFT, DPO, GRPO)

Recommendation: Evaluate Gemma-4-E4B-it and Gemma-4-26B-A4B-it as base models. The MoE architecture means the 26B model activates only 4B params per token -- fast inference, strong accuracy.

2. IBM Granite 4.0 3B Vision (Released Mar 27, 2026) -- HIGH RELEVANCE

HuggingFace: ibm-granite/granite-4.0-3b-vision

Architecture: LoRA adapter (~0.5B) on Granite 4.0 Micro (3.5B dense LLM), Apache 2.0

Why it matters:

We already have granite4-vision-sft at 1.0144 weighted_score -- THE BEST model in our entire eval!
85.5% exact-match accuracy on VAREX (structured form extraction), #3 among 2-4B models
Purpose-built for structured JSON/HTML extraction from documents
Very small footprint (3.5B params) -- can run multiple instances on our GPU
The vLLM-served versions (granite4-vision-sft-vllm) scored 0.4286, suggesting a serving/prompt issue, NOT a model quality issue

Recommendation: HIGH PRIORITY -- Debug the vLLM serving issue for granite4-vision-sft. This model already dominates our eval.

3. Microsoft Phi-4-reasoning-vision-15B (Released Mar 4, 2026) -- MEDIUM RELEVANCE

HuggingFace: microsoft/Phi-4-reasoning-vision-15B

Architecture: 15B params, Phi-4-Reasoning backbone + SigLIP-2 vision encoder, mid-fusion, 16K context

Why it matters:

Built specifically for visual reasoning with chain-of-thought
Our Phi-4-multimodal-sft scored only 0.6513, but this is a fundamentally different (and much better) model
15B fits on our 98GB GPU easily
Could be strong on structured attribute extraction with reasoning

Recommendation: Worth evaluating as a base for SFT. The reasoning capabilities could help with harder fields like closure.

4. GLM-5V-Turbo by Z.ai/Zhipu (Released Apr 1, 2026) -- LOW RELEVANCE

HuggingFace: zai-org/GLM-5 (base model only, MIT license)

Architecture: 744B total / 40B active MoE, CogViT vision encoder, native multimodal

Recommendation: Monitor but do not prioritize. Too large for efficient garment classification.

5. Moondream 3 (Preview, ongoing 2026) -- MEDIUM RELEVANCE

HuggingFace: moondream/moondream3-preview

Architecture: 9B total / 2B active MoE, 32K context, SuperBPE tokenizer

Why it matters:

Our moondream2-base scored 0.6979
Moondream 3 is a major upgrade with MoE (9B total but only 2B active)
Extremely efficient -- 2B active params means very fast inference

Recommendation: Re-evaluate once Moondream 3 exits preview.

6. NVIDIA Llama Nemotron Nano VL 8B (Earlier 2026) -- MEDIUM RELEVANCE

HuggingFace: nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1

Architecture: 8B, Llama-3.1-8B-Instruct + C-RADIOv2-VLM-H vision encoder

Why it matters:

SOTA on OCRBench v2
FP4 quantized version available
Optimized for structured extraction from documents

Recommendation: Worth a base eval. OCR strength could help with brand and size fields.

Summary and Priority Actions

Priority	Action	Expected Impact
P0	Debug granite4-vision-sft vLLM serving	Unlock our BEST model (1.0144) for production
P1	Evaluate Gemma 4 E4B-it and 26B-A4B-it as base	New architecture, Apache 2.0, strong vision benchmarks
P1	Evaluate Gemma 4 E4B-it with SFT pipeline	MoE efficiency could match Qwen3-VL-8B at lower compute
P2	Evaluate Phi-4-reasoning-vision-15B as base	Reasoning-focused model may help on harder fields
P2	Evaluate Llama Nemotron Nano VL 8B as base	OCR strength for brand/size extraction
P3	Monitor Moondream 3 for final release	Efficient 2B-active MoE for high-throughput inference

Key Takeaway: The biggest news this week is Gemma 4 (April 2) and the realization that our Granite 4 Vision SFT model already scores 1.0144 but has a vLLM serving bug. Fixing that serving issue is the single highest-ROI action available right now.

Generated by HF Model Scout -- 2026-04-04

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment