Daily Model Scout Report — 2026-04-02

by msudharsanan - opened Apr 2

Denali Advanced Integration org Apr 2

Daily Model Scout Report — 2026-04-02

Org: Denali-AI | Task: Garment attribute classification (9-field JSON extraction)
Current best: qwen3-vl-8b-sft+grpo @ 0.9131 weighted overall (3,500-sample hard eval)

🔥 New Releases This Week (March 26 – April 2, 2026)

1. Google Gemma 4 — Released April 2, 2026 (TODAY)

Model	Params (effective)	Context	License	Vision
gemma-4-E2B-it	2.3B	128K	Apache 2.0	✅ + Audio
gemma-4-E4B-it	4.5B	128K	Apache 2.0	✅ + Audio
gemma-4-31B-it	31B	256K	Apache 2.0	✅
gemma-4-26B-A4B-it	4B active (26B total MoE)	256K	Apache 2.0	✅

Architecture: Dense/MoE with hybrid sliding-window + global attention, Per-Layer Embeddings (PLE), learned 2D vision positions, variable aspect ratios (70–1120 image tokens)

Vision benchmarks: MMMU Pro 76.9% (31B), MATH-Vision 85.6% (31B), E4B scores MMMU Pro 52.6%

Why it matters for us:

gemma-4-E4B (4.5B effective) is a strong candidate to replace/complement our Qwen3-VL-2B models — more capable yet still small
gemma-4-E2B (2.3B) could be an edge deployment candidate with native vision
gemma-4-26B-A4B (4B active MoE) — extremely interesting for production: MoE efficiency with only 4B active params but 26B total knowledge. Fits easily on our RTX PRO 6000
Apache 2.0 license, Unsloth fine-tuning support confirmed, base models available for SFT
Variable image token budgets could help with inference speed tuning

Relevance: 🔴 HIGH — Benchmark immediately. The E4B and 26B-A4B models are prime fine-tuning candidates.

2. Qwen3.5-Omni — Released March 30, 2026

Variant	Modalities	Context	License
Qwen3.5-Omni-Plus	Text + Image + Audio + Video	256K	Closed
Qwen3.5-Omni-Flash	Text + Image + Audio + Video	256K	Closed
Qwen3.5-Omni-Light	Text + Image + Audio + Video	256K	Closed

Architecture: Native Thinker-Talker multimodal, unified text/audio/video processing, 10+ hrs audio, 400+ sec 720p video

Why it matters for us:

SOTA on 215 audio and audio-visual benchmarks
However, CLOSED SOURCE — breaks Alibaba's open-source streak
Cannot fine-tune, cannot self-host without API costs
We already have strong Qwen3.5-VL results with open weights

Relevance: 🟡 MEDIUM — Monitor for open-weight release. Not actionable for fine-tuning today.

📋 Recently Released Models Worth Noting (Last 30 Days)

3. Qwen3.5 Base VLM Family — Released February 16, 2026

We already have these integrated and benchmarked. Current standings on our hard eval:

qwen35-2b-base: 0.8437 weighted overall
qwen35-2b-sft-v7: 0.6369 (SFT degraded — format issues)
qwen35-2b-sft-grpo-gtpo-v8: 0.6535

These Qwen3.5 models underperform our Qwen3-VL-8B fine-tunes. The Qwen3.5 NVFP4 quantized models are broken (scoring ~0.43, similar to random baseline).

Relevance: 🟢 LOW — Already benchmarked. Qwen3.5 underperforms Qwen3-VL for our task.

4. InternVL3.5 — Released August 2025

Model	Params	Vision Encoder	Language Model	License
InternVL3.5-2B	2.3B	InternViT-300M	Qwen3-2B	Apache 2.0
InternVL3.5-4B	4.7B	InternViT-300M	Qwen3-4B	Apache 2.0
InternVL3.5-8B	8.5B	InternViT-300M	Qwen3-8B	Apache 2.0

Key improvement: +16% reasoning gain and 4.05x inference speedup over InternVL3. Cascade RL training (MPO + GSPO).

Why it matters: Our InternVL3-2B models scored only 0.72 weighted overall — significantly worse than Qwen3-VL. InternVL3.5's +16% reasoning improvement might close that gap. The 4B variant is new territory we haven't tested.

Relevance: 🟡 MEDIUM — InternVL has underperformed for us, but the 3.5 generation's improvements are substantial enough to re-evaluate the 4B and 8B variants.

5. MiniCPM-V 4.5 — Released September 2025

Model	Params	Architecture	License
MiniCPM-V-4.5	8.7B	Qwen3-8B + SigLIP2-400M	Apache 2.0

Key features: 96x video token compression, surpasses GPT-4o on OCR tasks, 4x fewer visual tokens than competitors

Why it matters: Built on Qwen3-8B (same base as our best model), but with SigLIP2 vision encoder and innovative 3D-Resampler. The reduced visual token count could significantly improve inference speed.

Relevance: 🟡 MEDIUM — Interesting architecture but not a new release. Could be worth a base model comparison.

📊 Current Leaderboard (Denali-AI Hard Eval, 3,500 samples)

Rank	Model	Weighted Overall	Notes
1	qwen3-vl-8b-sft+grpo	0.9131	Best overall
2	qwen3-vl-8b-sft-grpo-nvfp4	0.8945	Best quantized 8B
3	qwen3-vl-2b-sft-grpo-v9	0.8948	Best small model
4	qwen3-vl-8b-instruct-base	0.8751	No fine-tuning
5	qwen35-2b-base	0.8437	Qwen3.5 zero-shot
6	qwen3-vl-8b-instruct-nvfp4	0.8716	Quantized base
7	qwen3-vl-2b-sft-grpo-v9-nvfp4	0.8422	Quantized small
8	qwen3-vl-2b-instruct-base	0.7642	2B zero-shot
9	internvl3-2b-grpo-gtpo-full	0.7271	InternVL3 best
10	moondream2-base	0.6979	Tiny model baseline

🎯 Recommended Actions

IMMEDIATE: Benchmark Gemma 4 E4B and 26B-A4B base models on our hard eval set. These are brand new (released today), Apache 2.0, and architecturally novel. The E4B at 4.5B params and 26B-A4B with only 4B active params are sweet spots for our use case.
THIS WEEK: Evaluate Gemma 4 E2B (2.3B) as a potential edge/mobile model replacement for our Qwen3-VL-2B pipeline.
IF Gemma 4 base scores > 0.80: Initiate SFT fine-tuning pipeline adaptation for Gemma 4 architecture. Check Unsloth/TRL support (confirmed available).
MONITOR: Qwen3.5-Omni for open-weight release. If weights drop, benchmark immediately.
OPTIONAL: Re-evaluate InternVL3.5-4B as a mid-size option if Gemma 4 doesn't pan out.

Report generated by Claude Code Model Scout — Denali-AI
Next report: 2026-04-03

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment