Daily Model Scout Report — 2026-04-10

by msudharsanan - opened Apr 10

Denali Advanced Integration org Apr 10

Daily Model Scout Report — 2026-04-10

Current Baseline (3,500-sample hard eval, weighted composite score)

Rank	Model	Architecture	weighted_score
1	granite4-vision-sft	Granite 4.0 Vision	1.0144
2	qwen3-vl-8b-sft+grpo	Qwen3-VL 8B	0.9131
3	qwen3-vl-2b-sft-grpo-v9	Qwen3-VL 2B	0.8948
4	qwen3-vl-8b-sft-grpo-nvfp4	Qwen3-VL 8B (NVFP4)	0.8945
5	qwen3-vl-8b-instruct-base	Qwen3-VL 8B (base)	0.8751
6	qwen35-2b-base	Qwen3.5 2B (base)	0.8437

New Models Found (Last 7 Days)

HIGH — Benchmark Immediately

1. Google Gemma 4 (Released April 2-3, 2026)

Models: E2B (2.3B eff / 5.1B total), E4B (4.5B eff / 8B total), 26B-A4B (4B active / 26B MoE), 31B (dense)
HuggingFace: gemma-4-E2B, gemma-4-E4B, gemma-4-26B-A4B-it, gemma-4-31B-it
License: Apache 2.0
Architecture: Dense + MoE variants, hybrid sliding-window/global attention, Per-Layer Embeddings (PLE), dedicated vision encoder with variable aspect ratios and configurable token budgets (70-1120 tokens), audio encoder (small models)
Vision capabilities: Object detection (native JSON bounding boxes), OCR, chart comprehension, GUI detection, video understanding, spatial reasoning, captioning
Key benchmarks: MMMU Pro 76.9% (31B), 52.6% (E4B); MATH-Vision 85.6% (31B)
Fine-tuning: Supported via TRL, Unsloth, Vertex AI; LoRA/QLoRA compatible
Why it matters:
- The E2B (2.3B effective) directly competes with our Qwen3.5-2B models — if its base zero-shot is strong, SFT+GRPO could push it past our current small-model scores
- The E4B (4.5B effective) fills a gap we don't currently cover — a mid-size model that could balance accuracy and speed
- The 26B-A4B MoE activates only 4B params at inference — potentially 8B-class accuracy at 2B-class latency
- Apache 2.0 license, native JSON output capabilities, and excellent fine-tuning ecosystem
- Recommended: Benchmark E2B-it and E4B-it zero-shot first; if promising, run SFT pipeline on E4B

MEDIUM — Worth Watching

2. Microsoft Phi-4-Reasoning-Vision-15B (Released March 4, 2026)

HuggingFace: Phi-4-reasoning-vision-15B
Parameters: 15B, SigLIP-2 vision encoder, mid-fusion architecture
License: MIT
Architecture: Built on Phi-4-Reasoning backbone with configurable thinking mode
Why it matters: We tested base Phi-4-Multimodal (scored 0.6513 weighted), but this reasoning variant has a fundamentally different training approach with chain-of-thought. The thinking mode could help on our hardest classification samples. At 15B it fits comfortably on our RTX PRO 6000.
Caveat: Phi-4 base performed poorly on our task — the reasoning variant may not overcome the fundamental architecture mismatch with structured JSON extraction
Recommended: Zero-shot eval only; deprioritize behind Gemma 4

3. Alibaba Qwen3.6-Plus (Released April 2, 2026)

Status: API-only (closed-source). Open-weight variants "coming later"
Capabilities: 1M context, multimodal (vision + text), agentic coding, UI-to-code generation
Why it matters: Represents the next generation of the Qwen VL line we heavily use. When open-weight variants drop (likely Qwen3.6-VL in 2B/8B sizes), they could be direct upgrades to our Qwen3-VL pipeline
Recommended: Monitor for open-weight release; no action until weights are available

4. Moondream 3 + Segmentation Extension (Preview Sep 2025; Segmentation April 3, 2026)

HuggingFace: moondream3-preview
Parameters: 9B total / 2B active (MoE), 32K context
Why it matters: Moondream2 scored 0.6979 weighted on our eval. Moondream 3 is a significant architecture upgrade (MoE, larger context, grounded reasoning). The new segmentation extension (April 3) suggests active development.
Caveat: Still in preview; fine-tuning ecosystem less mature than Qwen/Gemma
Recommended: Re-eval zero-shot when stable release drops

LOW — Tangentially Relevant

5. IBM Granite 4.0 3B Vision (Released March 27, 2026)

Status: Already evaluated — scored 1.0144 weighted, our current best model
HuggingFace: granite-4.0-3b-vision
Note: Specialized for document extraction (charts, tables, KVP). Its strong performance on our garment task is a surprise — further investigation into why it outperforms larger models is warranted. Check if the vLLM deployment issues (granite4-vision-sft-vllm scored 0.4587) can be resolved for production use.

6. Meta Llama 4 Scout (Released earlier 2026)

Parameters: 17B active / 109B total (16 experts), 10M context
Why it matters: Impressive multimodal capabilities and huge context, but 109B total params makes fine-tuning impractical on our hardware. The 10M context is irrelevant for single-image classification.
Recommended: Skip unless quantized variants prove viable

7. InternVL3.5 (Released August 2025)

Status: Not new (6+ months old), but we haven't evaluated the 3.5 series
Models: 1B, 2B, 4B, 8B, 14B, 38B + MPO variants
Why it matters: InternVL3-2B scored 0.7271 weighted for us. InternVL3.5-2B reportedly scores significantly higher than InternVL3-2B on public benchmarks. Cascade RL training could be complementary to our GRPO/GTPO pipeline.
Recommended: Low priority — our Qwen-based models already outperform InternVL significantly

Summary and Recommended Actions

Priority	Action	Model	Rationale
P0	Zero-shot eval	Gemma 4 E2B-it, E4B-it	Brand new (April 2), Apache 2.0, competitive sizes, native JSON output
P0	Zero-shot eval	Gemma 4 26B-A4B-it	MoE with only 4B active — could match 8B accuracy at fraction of latency
P1	Investigate	Granite 4.0 vLLM deployment	Our best model (1.0144) but vLLM serving is broken (drops to 0.4587)
P2	Zero-shot eval	Phi-4-Reasoning-Vision-15B	Reasoning variant might improve over base Phi-4 (0.6513)
P3	Monitor	Qwen3.6 open-weight release	Next-gen Qwen VL; no action until weights drop

Report generated automatically by Denali-AI Model Scout
Eval baseline: 3,500-sample hard eval set with weighted composite scoring
Hardware target: NVIDIA RTX PRO 6000 (98GB VRAM)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment