Daily Model Scout Report — 2026-04-12

#10
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report — 2026-04-12

Scout scope: All VLM architectures on HuggingFace, created or updated April 5–12, 2026
Baseline: Our best models on the 3,500-sample hard eval set (weighted composite score)

Model weighted_score
qwen3-vl-8b-sft+grpo 0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9 0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4 0.8945 (best quantized)
qwen35-2b-base 0.8437 (best Qwen3.5 base)
granite4-vision-sft 1.0144 (highest raw score — needs vLLM validation)

HIGH Relevance — Benchmark Immediately

1. Gemma 4 (Google DeepMind) — Released early April 2026

  • Models: gemma-4-E2B (2.3B eff / 5.1B total), gemma-4-E4B (4.5B eff / 8B total), gemma-4-26B-A4B (MoE, 4B active / 26B total), gemma-4-31B (31B dense)
  • Architecture: Dense (E2B/E4B/31B) and MoE (26B-A4B). Native multimodal with learned 2D vision positions, variable aspect ratios, configurable token budgets. Shared KV cache for efficiency. Per-Layer Embeddings (PLE) for richer representations.
  • License: Apache 2.0
  • Why it matters:
    • E4B-it (8B total, 4.5B effective) is directly comparable to our Qwen3-VL-8B slot but with a newer architecture. LoRA fine-tuning requires only ~17GB VRAM (QLoRA on 16GB). Full TRL/SFTTrainer support from day one.
    • 26B-A4B (MoE) is the standout: only 4B active params per token but 26B total capacity — could deliver 8B-class accuracy at 2B-class inference cost. MMMU Pro: 73.8%, MATH-Vision: 82.4%.
    • E2B (2.3B eff) could replace our Qwen3.5-0.8B/2B small models with better vision capabilities including audio/video.
    • Massive community momentum: 108K+ downloads for E4B-it in first week, Unsloth GGUF ports already available.
  • Recommended action: Fine-tune gemma-4-E4B-it and gemma-4-26B-A4B-it with our ORR SFT pipeline. The MoE variant is especially interesting for production (low active params = fast inference).

2. InternVL3.5-2B / 8B (OpenGVLab) — Released August 2025, but HF-format variants recently added

  • Models: InternVL3_5-2B (2.3B), InternVL3_5-8B (8.5B), plus 1B/4B/30B/38B variants
  • Architecture: ViT-MLP-LLM with InternViT-300M vision encoder + Qwen3 LLM backbone. Cascade Reinforcement Learning (offline RL → online RL). Visual Resolution Router (ViR) for dynamic token efficiency.
  • License: Apache 2.0
  • Why it matters:
    • We tested InternVL3-2B (scored 0.7271) — InternVL3.5 adds Cascade RL and ViR which should improve structured output quality.
    • Same Qwen3 backbone as our best models, so our reward engine and GRPO/GTPO pipeline should transfer well.
    • The 2B variant is a direct comparison target for our qwen3-vl-2b-sft-grpo-v9 (0.8948).
  • Recommended action: Benchmark InternVL3.5-2B base, then SFT if base score exceeds InternVL3-2B's 0.7271.

MEDIUM Relevance — Worth Watching

3. MiniCPM-V 4.5 (OpenBMB) — Released August 2025

  • Model: MiniCPM-V-4_5 (8.7B)
  • Architecture: Qwen3-8B + SigLIP2-400M, unified 3D-Resampler, fast/deep thinking modes
  • License: Apache 2.0
  • Why it matters: Surpasses GPT-4o-latest on OpenCompass with only 8.7B params. Strong OCR and document understanding. However, it's optimized for conversational understanding rather than structured classification — our JSON extraction task may not benefit from its strengths.
  • Status: Not yet evaluated on our benchmark.

4. Qwen3.5 Native Multimodal (Alibaba) — Released Feb-March 2026

  • Models: 0.8B through 397B-A17B with native early-fusion multimodal training
  • Why it matters: We already have Qwen3.5-2B evaluated (base: 0.8437, ORR-SFT: 0.7964). The 4B and 9B sizes remain untested with our full ORR pipeline. The native multimodal fusion could give better vision understanding than the separate Qwen3-VL encoder approach.
  • Recommended action: Run ORR SFT on Qwen3.5-4B and compare against qwen3-vl-2b-sft-grpo-v9 (0.8948).

5. Moondream 3 Preview — Released September 2025

  • Model: moondream3-preview (9B total, 2B active MoE)
  • Architecture: MoE with SigLIP vision encoder, 32K context, grounded visual reasoning
  • License: Apache 2.0
  • Why it matters: We tested Moondream2 (0.6979 weighted). Moondream3 with MoE (2B active / 9B total) could be a significant jump. Efficient inference profile similar to Gemma-4 26B-A4B concept.
  • Status: Not yet evaluated.

LOW Relevance — Noted

6. GLM-5V-Turbo (Z.ai / Zhipu) — Released April 1, 2026

  • 744B params (MoE, 40B active). Not open source — API only at $1.20/$4.00 per M tokens. Cannot fine-tune. Irrelevant for our pipeline.

7. Holo3-35B-A3B (H Company) — Released March 31, 2026

  • Holo3-35B-A3B — Fine-tuned from Qwen3.5-35B-A3B, 3B active params. Optimized for GUI agents (screen reading, clicking), not image classification. Apache 2.0 but wrong task domain.

8. Phi-4-Reasoning-Vision-15B (Microsoft) — Released March 4, 2026

  • 15B params, SigLIP-2 encoder. We already tested Phi-4-Multimodal variants (best: 0.6513 with SFT). The Phi-4 architecture consistently underperforms Qwen3 on our structured JSON extraction task.

9. Baidu Qianfan-OCR — Trending April 5, 2026

  • Specialized OCR model for Chinese/multilingual document understanding. Not suitable for garment classification.

Summary & Recommended Next Steps

Priority Model Action
🔴 P0 Gemma 4 E4B-it (8B) SFT + GRPO eval — direct competitor to our Qwen3-VL-8B slot
🔴 P0 Gemma 4 26B-A4B-it (MoE, 4B active) SFT eval — could match 8B accuracy at 2B inference cost
🟡 P1 InternVL3.5-2B Base eval first, then SFT if promising
🟡 P1 Gemma 4 E2B-it (2.3B eff) Base eval — potential Qwen3.5-0.8B/2B replacement
🟢 P2 Qwen3.5-4B ORR SFT — untested size point in a proven family
🟢 P2 MiniCPM-V 4.5 Base eval on our benchmark
🟢 P2 Moondream 3 Preview Base eval — MoE efficiency play

Key trend: MoE architectures are now available at every scale (Gemma-4 26B-A4B, Moondream3, Holo3). The efficiency gains from low active-param counts could let us run 8B-quality models at 2B inference budgets on the RTX PRO 6000.


Report generated by Model Scout — Denali-AI
Baselines: 3,500-sample hard eval set, weighted composite scoring

Sign up or log in to comment