Daily Model Scout Report β€” 2026-04-02

#3
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-04-02

Org: Denali-AI | Task: Garment attribute classification (9-field JSON extraction)
Current best: qwen3-vl-8b-sft+grpo @ 0.9131 weighted overall (3,500-sample hard eval)


πŸ”₯ New Releases This Week (March 26 – April 2, 2026)

1. Google Gemma 4 β€” Released April 2, 2026 (TODAY)

Model Params (effective) Context License Vision
gemma-4-E2B-it 2.3B 128K Apache 2.0 βœ… + Audio
gemma-4-E4B-it 4.5B 128K Apache 2.0 βœ… + Audio
gemma-4-31B-it 31B 256K Apache 2.0 βœ…
gemma-4-26B-A4B-it 4B active (26B total MoE) 256K Apache 2.0 βœ…

Architecture: Dense/MoE with hybrid sliding-window + global attention, Per-Layer Embeddings (PLE), learned 2D vision positions, variable aspect ratios (70–1120 image tokens)

Vision benchmarks: MMMU Pro 76.9% (31B), MATH-Vision 85.6% (31B), E4B scores MMMU Pro 52.6%

Why it matters for us:

  • gemma-4-E4B (4.5B effective) is a strong candidate to replace/complement our Qwen3-VL-2B models β€” more capable yet still small
  • gemma-4-E2B (2.3B) could be an edge deployment candidate with native vision
  • gemma-4-26B-A4B (4B active MoE) β€” extremely interesting for production: MoE efficiency with only 4B active params but 26B total knowledge. Fits easily on our RTX PRO 6000
  • Apache 2.0 license, Unsloth fine-tuning support confirmed, base models available for SFT
  • Variable image token budgets could help with inference speed tuning

Relevance: πŸ”΄ HIGH β€” Benchmark immediately. The E4B and 26B-A4B models are prime fine-tuning candidates.


2. Qwen3.5-Omni β€” Released March 30, 2026

Variant Modalities Context License
Qwen3.5-Omni-Plus Text + Image + Audio + Video 256K Closed
Qwen3.5-Omni-Flash Text + Image + Audio + Video 256K Closed
Qwen3.5-Omni-Light Text + Image + Audio + Video 256K Closed

Architecture: Native Thinker-Talker multimodal, unified text/audio/video processing, 10+ hrs audio, 400+ sec 720p video

Why it matters for us:

  • SOTA on 215 audio and audio-visual benchmarks
  • However, CLOSED SOURCE β€” breaks Alibaba's open-source streak
  • Cannot fine-tune, cannot self-host without API costs
  • We already have strong Qwen3.5-VL results with open weights

Relevance: 🟑 MEDIUM β€” Monitor for open-weight release. Not actionable for fine-tuning today.


πŸ“‹ Recently Released Models Worth Noting (Last 30 Days)

3. Qwen3.5 Base VLM Family β€” Released February 16, 2026

We already have these integrated and benchmarked. Current standings on our hard eval:

  • qwen35-2b-base: 0.8437 weighted overall
  • qwen35-2b-sft-v7: 0.6369 (SFT degraded β€” format issues)
  • qwen35-2b-sft-grpo-gtpo-v8: 0.6535

These Qwen3.5 models underperform our Qwen3-VL-8B fine-tunes. The Qwen3.5 NVFP4 quantized models are broken (scoring ~0.43, similar to random baseline).

Relevance: 🟒 LOW β€” Already benchmarked. Qwen3.5 underperforms Qwen3-VL for our task.


4. InternVL3.5 β€” Released August 2025

Model Params Vision Encoder Language Model License
InternVL3.5-2B 2.3B InternViT-300M Qwen3-2B Apache 2.0
InternVL3.5-4B 4.7B InternViT-300M Qwen3-4B Apache 2.0
InternVL3.5-8B 8.5B InternViT-300M Qwen3-8B Apache 2.0

Key improvement: +16% reasoning gain and 4.05x inference speedup over InternVL3. Cascade RL training (MPO + GSPO).

Why it matters: Our InternVL3-2B models scored only 0.72 weighted overall β€” significantly worse than Qwen3-VL. InternVL3.5's +16% reasoning improvement might close that gap. The 4B variant is new territory we haven't tested.

Relevance: 🟑 MEDIUM β€” InternVL has underperformed for us, but the 3.5 generation's improvements are substantial enough to re-evaluate the 4B and 8B variants.


5. MiniCPM-V 4.5 β€” Released September 2025

Model Params Architecture License
MiniCPM-V-4.5 8.7B Qwen3-8B + SigLIP2-400M Apache 2.0

Key features: 96x video token compression, surpasses GPT-4o on OCR tasks, 4x fewer visual tokens than competitors

Why it matters: Built on Qwen3-8B (same base as our best model), but with SigLIP2 vision encoder and innovative 3D-Resampler. The reduced visual token count could significantly improve inference speed.

Relevance: 🟑 MEDIUM β€” Interesting architecture but not a new release. Could be worth a base model comparison.


πŸ“Š Current Leaderboard (Denali-AI Hard Eval, 3,500 samples)

Rank Model Weighted Overall Notes
1 qwen3-vl-8b-sft+grpo 0.9131 Best overall
2 qwen3-vl-8b-sft-grpo-nvfp4 0.8945 Best quantized 8B
3 qwen3-vl-2b-sft-grpo-v9 0.8948 Best small model
4 qwen3-vl-8b-instruct-base 0.8751 No fine-tuning
5 qwen35-2b-base 0.8437 Qwen3.5 zero-shot
6 qwen3-vl-8b-instruct-nvfp4 0.8716 Quantized base
7 qwen3-vl-2b-sft-grpo-v9-nvfp4 0.8422 Quantized small
8 qwen3-vl-2b-instruct-base 0.7642 2B zero-shot
9 internvl3-2b-grpo-gtpo-full 0.7271 InternVL3 best
10 moondream2-base 0.6979 Tiny model baseline

🎯 Recommended Actions

  1. IMMEDIATE: Benchmark Gemma 4 E4B and 26B-A4B base models on our hard eval set. These are brand new (released today), Apache 2.0, and architecturally novel. The E4B at 4.5B params and 26B-A4B with only 4B active params are sweet spots for our use case.

  2. THIS WEEK: Evaluate Gemma 4 E2B (2.3B) as a potential edge/mobile model replacement for our Qwen3-VL-2B pipeline.

  3. IF Gemma 4 base scores > 0.80: Initiate SFT fine-tuning pipeline adaptation for Gemma 4 architecture. Check Unsloth/TRL support (confirmed available).

  4. MONITOR: Qwen3.5-Omni for open-weight release. If weights drop, benchmark immediately.

  5. OPTIONAL: Re-evaluate InternVL3.5-4B as a mid-size option if Gemma 4 doesn't pan out.


Report generated by Claude Code Model Scout β€” Denali-AI
Next report: 2026-04-03

Sign up or log in to comment