Daily Model Scout Report β€” 2026-04-10

#8
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-04-10

Current Baseline (3,500-sample hard eval, weighted composite score)

Rank Model Architecture weighted_score
1 granite4-vision-sft Granite 4.0 Vision 1.0144
2 qwen3-vl-8b-sft+grpo Qwen3-VL 8B 0.9131
3 qwen3-vl-2b-sft-grpo-v9 Qwen3-VL 2B 0.8948
4 qwen3-vl-8b-sft-grpo-nvfp4 Qwen3-VL 8B (NVFP4) 0.8945
5 qwen3-vl-8b-instruct-base Qwen3-VL 8B (base) 0.8751
6 qwen35-2b-base Qwen3.5 2B (base) 0.8437

New Models Found (Last 7 Days)

HIGH β€” Benchmark Immediately

1. Google Gemma 4 (Released April 2-3, 2026)

  • Models: E2B (2.3B eff / 5.1B total), E4B (4.5B eff / 8B total), 26B-A4B (4B active / 26B MoE), 31B (dense)
  • HuggingFace: gemma-4-E2B, gemma-4-E4B, gemma-4-26B-A4B-it, gemma-4-31B-it
  • License: Apache 2.0
  • Architecture: Dense + MoE variants, hybrid sliding-window/global attention, Per-Layer Embeddings (PLE), dedicated vision encoder with variable aspect ratios and configurable token budgets (70-1120 tokens), audio encoder (small models)
  • Vision capabilities: Object detection (native JSON bounding boxes), OCR, chart comprehension, GUI detection, video understanding, spatial reasoning, captioning
  • Key benchmarks: MMMU Pro 76.9% (31B), 52.6% (E4B); MATH-Vision 85.6% (31B)
  • Fine-tuning: Supported via TRL, Unsloth, Vertex AI; LoRA/QLoRA compatible
  • Why it matters:
    • The E2B (2.3B effective) directly competes with our Qwen3.5-2B models β€” if its base zero-shot is strong, SFT+GRPO could push it past our current small-model scores
    • The E4B (4.5B effective) fills a gap we don't currently cover β€” a mid-size model that could balance accuracy and speed
    • The 26B-A4B MoE activates only 4B params at inference β€” potentially 8B-class accuracy at 2B-class latency
    • Apache 2.0 license, native JSON output capabilities, and excellent fine-tuning ecosystem
    • Recommended: Benchmark E2B-it and E4B-it zero-shot first; if promising, run SFT pipeline on E4B

MEDIUM β€” Worth Watching

2. Microsoft Phi-4-Reasoning-Vision-15B (Released March 4, 2026)

  • HuggingFace: Phi-4-reasoning-vision-15B
  • Parameters: 15B, SigLIP-2 vision encoder, mid-fusion architecture
  • License: MIT
  • Architecture: Built on Phi-4-Reasoning backbone with configurable thinking mode
  • Why it matters: We tested base Phi-4-Multimodal (scored 0.6513 weighted), but this reasoning variant has a fundamentally different training approach with chain-of-thought. The thinking mode could help on our hardest classification samples. At 15B it fits comfortably on our RTX PRO 6000.
  • Caveat: Phi-4 base performed poorly on our task β€” the reasoning variant may not overcome the fundamental architecture mismatch with structured JSON extraction
  • Recommended: Zero-shot eval only; deprioritize behind Gemma 4

3. Alibaba Qwen3.6-Plus (Released April 2, 2026)

  • Status: API-only (closed-source). Open-weight variants "coming later"
  • Capabilities: 1M context, multimodal (vision + text), agentic coding, UI-to-code generation
  • Why it matters: Represents the next generation of the Qwen VL line we heavily use. When open-weight variants drop (likely Qwen3.6-VL in 2B/8B sizes), they could be direct upgrades to our Qwen3-VL pipeline
  • Recommended: Monitor for open-weight release; no action until weights are available

4. Moondream 3 + Segmentation Extension (Preview Sep 2025; Segmentation April 3, 2026)

  • HuggingFace: moondream3-preview
  • Parameters: 9B total / 2B active (MoE), 32K context
  • Why it matters: Moondream2 scored 0.6979 weighted on our eval. Moondream 3 is a significant architecture upgrade (MoE, larger context, grounded reasoning). The new segmentation extension (April 3) suggests active development.
  • Caveat: Still in preview; fine-tuning ecosystem less mature than Qwen/Gemma
  • Recommended: Re-eval zero-shot when stable release drops

LOW β€” Tangentially Relevant

5. IBM Granite 4.0 3B Vision (Released March 27, 2026)

  • Status: Already evaluated β€” scored 1.0144 weighted, our current best model
  • HuggingFace: granite-4.0-3b-vision
  • Note: Specialized for document extraction (charts, tables, KVP). Its strong performance on our garment task is a surprise β€” further investigation into why it outperforms larger models is warranted. Check if the vLLM deployment issues (granite4-vision-sft-vllm scored 0.4587) can be resolved for production use.

6. Meta Llama 4 Scout (Released earlier 2026)

  • Parameters: 17B active / 109B total (16 experts), 10M context
  • Why it matters: Impressive multimodal capabilities and huge context, but 109B total params makes fine-tuning impractical on our hardware. The 10M context is irrelevant for single-image classification.
  • Recommended: Skip unless quantized variants prove viable

7. InternVL3.5 (Released August 2025)

  • Status: Not new (6+ months old), but we haven't evaluated the 3.5 series
  • Models: 1B, 2B, 4B, 8B, 14B, 38B + MPO variants
  • Why it matters: InternVL3-2B scored 0.7271 weighted for us. InternVL3.5-2B reportedly scores significantly higher than InternVL3-2B on public benchmarks. Cascade RL training could be complementary to our GRPO/GTPO pipeline.
  • Recommended: Low priority β€” our Qwen-based models already outperform InternVL significantly

Summary and Recommended Actions

Priority Action Model Rationale
P0 Zero-shot eval Gemma 4 E2B-it, E4B-it Brand new (April 2), Apache 2.0, competitive sizes, native JSON output
P0 Zero-shot eval Gemma 4 26B-A4B-it MoE with only 4B active β€” could match 8B accuracy at fraction of latency
P1 Investigate Granite 4.0 vLLM deployment Our best model (1.0144) but vLLM serving is broken (drops to 0.4587)
P2 Zero-shot eval Phi-4-Reasoning-Vision-15B Reasoning variant might improve over base Phi-4 (0.6513)
P3 Monitor Qwen3.6 open-weight release Next-gen Qwen VL; no action until weights drop

Report generated automatically by Denali-AI Model Scout
Eval baseline: 3,500-sample hard eval set with weighted composite scoring
Hardware target: NVIDIA RTX PRO 6000 (98GB VRAM)

Sign up or log in to comment