Daily Model Scout Report — 2026-04-12
#10
by msudharsanan - opened
Daily Model Scout Report — 2026-04-12
Scout scope: All VLM architectures on HuggingFace, created or updated April 5–12, 2026
Baseline: Our best models on the 3,500-sample hard eval set (weighted composite score)
| Model | weighted_score |
|---|---|
| qwen3-vl-8b-sft+grpo | 0.9131 (best overall) |
| qwen3-vl-2b-sft-grpo-v9 | 0.8948 (best small) |
| qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 (best quantized) |
| qwen35-2b-base | 0.8437 (best Qwen3.5 base) |
| granite4-vision-sft | 1.0144 (highest raw score — needs vLLM validation) |
HIGH Relevance — Benchmark Immediately
1. Gemma 4 (Google DeepMind) — Released early April 2026
- Models: gemma-4-E2B (2.3B eff / 5.1B total), gemma-4-E4B (4.5B eff / 8B total), gemma-4-26B-A4B (MoE, 4B active / 26B total), gemma-4-31B (31B dense)
- Architecture: Dense (E2B/E4B/31B) and MoE (26B-A4B). Native multimodal with learned 2D vision positions, variable aspect ratios, configurable token budgets. Shared KV cache for efficiency. Per-Layer Embeddings (PLE) for richer representations.
- License: Apache 2.0
- Why it matters:
- E4B-it (8B total, 4.5B effective) is directly comparable to our Qwen3-VL-8B slot but with a newer architecture. LoRA fine-tuning requires only ~17GB VRAM (QLoRA on 16GB). Full TRL/SFTTrainer support from day one.
- 26B-A4B (MoE) is the standout: only 4B active params per token but 26B total capacity — could deliver 8B-class accuracy at 2B-class inference cost. MMMU Pro: 73.8%, MATH-Vision: 82.4%.
- E2B (2.3B eff) could replace our Qwen3.5-0.8B/2B small models with better vision capabilities including audio/video.
- Massive community momentum: 108K+ downloads for E4B-it in first week, Unsloth GGUF ports already available.
- Recommended action: Fine-tune gemma-4-E4B-it and gemma-4-26B-A4B-it with our ORR SFT pipeline. The MoE variant is especially interesting for production (low active params = fast inference).
2. InternVL3.5-2B / 8B (OpenGVLab) — Released August 2025, but HF-format variants recently added
- Models: InternVL3_5-2B (2.3B), InternVL3_5-8B (8.5B), plus 1B/4B/30B/38B variants
- Architecture: ViT-MLP-LLM with InternViT-300M vision encoder + Qwen3 LLM backbone. Cascade Reinforcement Learning (offline RL → online RL). Visual Resolution Router (ViR) for dynamic token efficiency.
- License: Apache 2.0
- Why it matters:
- We tested InternVL3-2B (scored 0.7271) — InternVL3.5 adds Cascade RL and ViR which should improve structured output quality.
- Same Qwen3 backbone as our best models, so our reward engine and GRPO/GTPO pipeline should transfer well.
- The 2B variant is a direct comparison target for our qwen3-vl-2b-sft-grpo-v9 (0.8948).
- Recommended action: Benchmark InternVL3.5-2B base, then SFT if base score exceeds InternVL3-2B's 0.7271.
MEDIUM Relevance — Worth Watching
3. MiniCPM-V 4.5 (OpenBMB) — Released August 2025
- Model: MiniCPM-V-4_5 (8.7B)
- Architecture: Qwen3-8B + SigLIP2-400M, unified 3D-Resampler, fast/deep thinking modes
- License: Apache 2.0
- Why it matters: Surpasses GPT-4o-latest on OpenCompass with only 8.7B params. Strong OCR and document understanding. However, it's optimized for conversational understanding rather than structured classification — our JSON extraction task may not benefit from its strengths.
- Status: Not yet evaluated on our benchmark.
4. Qwen3.5 Native Multimodal (Alibaba) — Released Feb-March 2026
- Models: 0.8B through 397B-A17B with native early-fusion multimodal training
- Why it matters: We already have Qwen3.5-2B evaluated (base: 0.8437, ORR-SFT: 0.7964). The 4B and 9B sizes remain untested with our full ORR pipeline. The native multimodal fusion could give better vision understanding than the separate Qwen3-VL encoder approach.
- Recommended action: Run ORR SFT on Qwen3.5-4B and compare against qwen3-vl-2b-sft-grpo-v9 (0.8948).
5. Moondream 3 Preview — Released September 2025
- Model: moondream3-preview (9B total, 2B active MoE)
- Architecture: MoE with SigLIP vision encoder, 32K context, grounded visual reasoning
- License: Apache 2.0
- Why it matters: We tested Moondream2 (0.6979 weighted). Moondream3 with MoE (2B active / 9B total) could be a significant jump. Efficient inference profile similar to Gemma-4 26B-A4B concept.
- Status: Not yet evaluated.
LOW Relevance — Noted
6. GLM-5V-Turbo (Z.ai / Zhipu) — Released April 1, 2026
- 744B params (MoE, 40B active). Not open source — API only at $1.20/$4.00 per M tokens. Cannot fine-tune. Irrelevant for our pipeline.
7. Holo3-35B-A3B (H Company) — Released March 31, 2026
- Holo3-35B-A3B — Fine-tuned from Qwen3.5-35B-A3B, 3B active params. Optimized for GUI agents (screen reading, clicking), not image classification. Apache 2.0 but wrong task domain.
8. Phi-4-Reasoning-Vision-15B (Microsoft) — Released March 4, 2026
- 15B params, SigLIP-2 encoder. We already tested Phi-4-Multimodal variants (best: 0.6513 with SFT). The Phi-4 architecture consistently underperforms Qwen3 on our structured JSON extraction task.
9. Baidu Qianfan-OCR — Trending April 5, 2026
- Specialized OCR model for Chinese/multilingual document understanding. Not suitable for garment classification.
Summary & Recommended Next Steps
| Priority | Model | Action |
|---|---|---|
| 🔴 P0 | Gemma 4 E4B-it (8B) | SFT + GRPO eval — direct competitor to our Qwen3-VL-8B slot |
| 🔴 P0 | Gemma 4 26B-A4B-it (MoE, 4B active) | SFT eval — could match 8B accuracy at 2B inference cost |
| 🟡 P1 | InternVL3.5-2B | Base eval first, then SFT if promising |
| 🟡 P1 | Gemma 4 E2B-it (2.3B eff) | Base eval — potential Qwen3.5-0.8B/2B replacement |
| 🟢 P2 | Qwen3.5-4B | ORR SFT — untested size point in a proven family |
| 🟢 P2 | MiniCPM-V 4.5 | Base eval on our benchmark |
| 🟢 P2 | Moondream 3 Preview | Base eval — MoE efficiency play |
Key trend: MoE architectures are now available at every scale (Gemma-4 26B-A4B, Moondream3, Holo3). The efficiency gains from low active-param counts could let us run 8B-quality models at 2B inference budgets on the RTX PRO 6000.
Report generated by Model Scout — Denali-AI
Baselines: 3,500-sample hard eval set, weighted composite scoring