Daily Model Scout Report β 2026-04-10
#8
by msudharsanan - opened
Daily Model Scout Report β 2026-04-10
Current Baseline (3,500-sample hard eval, weighted composite score)
| Rank | Model | Architecture | weighted_score |
|---|---|---|---|
| 1 | granite4-vision-sft | Granite 4.0 Vision | 1.0144 |
| 2 | qwen3-vl-8b-sft+grpo | Qwen3-VL 8B | 0.9131 |
| 3 | qwen3-vl-2b-sft-grpo-v9 | Qwen3-VL 2B | 0.8948 |
| 4 | qwen3-vl-8b-sft-grpo-nvfp4 | Qwen3-VL 8B (NVFP4) | 0.8945 |
| 5 | qwen3-vl-8b-instruct-base | Qwen3-VL 8B (base) | 0.8751 |
| 6 | qwen35-2b-base | Qwen3.5 2B (base) | 0.8437 |
New Models Found (Last 7 Days)
HIGH β Benchmark Immediately
1. Google Gemma 4 (Released April 2-3, 2026)
- Models: E2B (2.3B eff / 5.1B total), E4B (4.5B eff / 8B total), 26B-A4B (4B active / 26B MoE), 31B (dense)
- HuggingFace: gemma-4-E2B, gemma-4-E4B, gemma-4-26B-A4B-it, gemma-4-31B-it
- License: Apache 2.0
- Architecture: Dense + MoE variants, hybrid sliding-window/global attention, Per-Layer Embeddings (PLE), dedicated vision encoder with variable aspect ratios and configurable token budgets (70-1120 tokens), audio encoder (small models)
- Vision capabilities: Object detection (native JSON bounding boxes), OCR, chart comprehension, GUI detection, video understanding, spatial reasoning, captioning
- Key benchmarks: MMMU Pro 76.9% (31B), 52.6% (E4B); MATH-Vision 85.6% (31B)
- Fine-tuning: Supported via TRL, Unsloth, Vertex AI; LoRA/QLoRA compatible
- Why it matters:
- The E2B (2.3B effective) directly competes with our Qwen3.5-2B models β if its base zero-shot is strong, SFT+GRPO could push it past our current small-model scores
- The E4B (4.5B effective) fills a gap we don't currently cover β a mid-size model that could balance accuracy and speed
- The 26B-A4B MoE activates only 4B params at inference β potentially 8B-class accuracy at 2B-class latency
- Apache 2.0 license, native JSON output capabilities, and excellent fine-tuning ecosystem
- Recommended: Benchmark E2B-it and E4B-it zero-shot first; if promising, run SFT pipeline on E4B
MEDIUM β Worth Watching
2. Microsoft Phi-4-Reasoning-Vision-15B (Released March 4, 2026)
- HuggingFace: Phi-4-reasoning-vision-15B
- Parameters: 15B, SigLIP-2 vision encoder, mid-fusion architecture
- License: MIT
- Architecture: Built on Phi-4-Reasoning backbone with configurable thinking mode
- Why it matters: We tested base Phi-4-Multimodal (scored 0.6513 weighted), but this reasoning variant has a fundamentally different training approach with chain-of-thought. The thinking mode could help on our hardest classification samples. At 15B it fits comfortably on our RTX PRO 6000.
- Caveat: Phi-4 base performed poorly on our task β the reasoning variant may not overcome the fundamental architecture mismatch with structured JSON extraction
- Recommended: Zero-shot eval only; deprioritize behind Gemma 4
3. Alibaba Qwen3.6-Plus (Released April 2, 2026)
- Status: API-only (closed-source). Open-weight variants "coming later"
- Capabilities: 1M context, multimodal (vision + text), agentic coding, UI-to-code generation
- Why it matters: Represents the next generation of the Qwen VL line we heavily use. When open-weight variants drop (likely Qwen3.6-VL in 2B/8B sizes), they could be direct upgrades to our Qwen3-VL pipeline
- Recommended: Monitor for open-weight release; no action until weights are available
4. Moondream 3 + Segmentation Extension (Preview Sep 2025; Segmentation April 3, 2026)
- HuggingFace: moondream3-preview
- Parameters: 9B total / 2B active (MoE), 32K context
- Why it matters: Moondream2 scored 0.6979 weighted on our eval. Moondream 3 is a significant architecture upgrade (MoE, larger context, grounded reasoning). The new segmentation extension (April 3) suggests active development.
- Caveat: Still in preview; fine-tuning ecosystem less mature than Qwen/Gemma
- Recommended: Re-eval zero-shot when stable release drops
LOW β Tangentially Relevant
5. IBM Granite 4.0 3B Vision (Released March 27, 2026)
- Status: Already evaluated β scored 1.0144 weighted, our current best model
- HuggingFace: granite-4.0-3b-vision
- Note: Specialized for document extraction (charts, tables, KVP). Its strong performance on our garment task is a surprise β further investigation into why it outperforms larger models is warranted. Check if the vLLM deployment issues (granite4-vision-sft-vllm scored 0.4587) can be resolved for production use.
6. Meta Llama 4 Scout (Released earlier 2026)
- Parameters: 17B active / 109B total (16 experts), 10M context
- Why it matters: Impressive multimodal capabilities and huge context, but 109B total params makes fine-tuning impractical on our hardware. The 10M context is irrelevant for single-image classification.
- Recommended: Skip unless quantized variants prove viable
7. InternVL3.5 (Released August 2025)
- Status: Not new (6+ months old), but we haven't evaluated the 3.5 series
- Models: 1B, 2B, 4B, 8B, 14B, 38B + MPO variants
- Why it matters: InternVL3-2B scored 0.7271 weighted for us. InternVL3.5-2B reportedly scores significantly higher than InternVL3-2B on public benchmarks. Cascade RL training could be complementary to our GRPO/GTPO pipeline.
- Recommended: Low priority β our Qwen-based models already outperform InternVL significantly
Summary and Recommended Actions
| Priority | Action | Model | Rationale |
|---|---|---|---|
| P0 | Zero-shot eval | Gemma 4 E2B-it, E4B-it | Brand new (April 2), Apache 2.0, competitive sizes, native JSON output |
| P0 | Zero-shot eval | Gemma 4 26B-A4B-it | MoE with only 4B active β could match 8B accuracy at fraction of latency |
| P1 | Investigate | Granite 4.0 vLLM deployment | Our best model (1.0144) but vLLM serving is broken (drops to 0.4587) |
| P2 | Zero-shot eval | Phi-4-Reasoning-Vision-15B | Reasoning variant might improve over base Phi-4 (0.6513) |
| P3 | Monitor | Qwen3.6 open-weight release | Next-gen Qwen VL; no action until weights drop |
Report generated automatically by Denali-AI Model Scout
Eval baseline: 3,500-sample hard eval set with weighted composite scoring
Hardware target: NVIDIA RTX PRO 6000 (98GB VRAM)