Daily Model Scout Report β 2026-04-13
Daily VLM Scout Report β 2026-04-13
Scope: All new/updated Vision-Language Models on HuggingFace relevant to garment attribute classification (past 7 days)
Current Baseline (3,500-sample hard eval, weighted composite)
| Model | Weighted Score | Notes |
|---|---|---|
| qwen3-vl-8b-sft+grpo | 0.9131 | Best overall |
| granite4-vision-sft | 1.0144 | Best raw score (SFT only, needs GRPO) |
| qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 | Best quantized |
| qwen3-vl-2b-sft-grpo-v9 | 0.8948 | Best small model |
| qwen35-2b-base | 0.8437 | Best Qwen3.5 base |
Note: Granite4-Vision-SFT achieved 1.0144 weighted score with SFT alone β higher than our Qwen3-VL-8B SFT+GRPO. Adding GRPO/GTPO to Granite4 could push even further.
HIGH Priority β Benchmark Immediately
1. Google Gemma 4 (E4B / 26B-A4B)
- Released: April 2, 2026
- HuggingFace: google/gemma-4-E4B-it | google/gemma-4-26B-A4B-it
- Architecture: Gemma 4 with Per-Layer Embeddings (PLE) and MoE routing
- Parameters: E4B = 4.5B effective (8B w/ embeddings); 26B-A4B = 4B active / 26B total MoE
- Vision: Learned 2D positions, variable aspect ratio, configurable token budgets (70-1120 tokens)
- License: Apache 2.0
- Benchmarks: MMMU Pro: E4B 52.6%, 26B-A4B 73.8% (vs Gemma 3 27B: 49.7%)
- Fine-tuning: TRL, Unsloth, PEFT/QLoRA all supported (note: MoE variant has QLoRA limitations with bitsandbytes)
- Why relevant: Brand new architecture from Google. The 26B-A4B activates only 4B params per token β faster than dense 8B models with potentially higher accuracy. Both are Apache 2.0.
- Recommended action: Benchmark E4B-it and 26B-A4B-it base, then SFT the best performer.
- Relevance: HIGH
2. OpenGVLab InternVL3.5 (2B / 8B)
- HuggingFace: OpenGVLab/InternVL3_5-2B | OpenGVLab/InternVL3_5-8B
- Architecture: ViT-MLP-LLM with InternViT-300M/6B + Qwen3 backbone
- Parameters: 1B, 2B, 4B, 8B, 14B variants
- Benchmarks: InternVL3.5-8B MMMU: 73.4 (vs InternVL3-8B: 44.3 β a +29 point jump); InternVL3.5-2B avg: 76.5 across 9 benchmarks
- Training: Cascade RL (MPO offline then GSPO online), LoRA fine-tuning supported
- License: Apache 2.0 (check model card)
- Why relevant: We tested InternVL3-2B (scored 0.64-0.68 weighted). The 3.5 generation shows +16% reasoning gains and 4x inference speedup. Qwen3 backbone aligns with our existing expertise.
- Recommended action: Benchmark InternVL3.5-2B and 8B base, then SFT+GRPO.
- Relevance: HIGH
3. Granite 4.0 3B Vision β GRPO/GTPO Expansion
- Released: March 27, 2026
- HuggingFace: ibm-granite/granite-4.0-3b-vision
- Architecture: LoRA adapter on Granite 4.0 Micro, modular vision/language
- Parameters: 3B | License: Apache 2.0
- Why relevant: Our SFT-only model scored 1.0144 weighted β the highest in our eval suite, surpassing qwen3-vl-8b-sft+grpo (0.9131). Adding GRPO and GTPO stages could push this significantly further.
- Recommended action: Priority GRPO training run on granite4-vision-sft.
- Relevance: HIGH
MEDIUM Priority β Worth Watching
4. Google Gemma 4 31B (Dense)
- HuggingFace: google/gemma-4-31B-it
- 31B dense, 256K context, MMMU Pro: 76.9%, Apache 2.0
- Fits on RTX PRO 6000 but slow inference. Good as teacher model for distillation.
- Relevance: MEDIUM
5. Moondream 3 Preview
- HuggingFace: moondream/moondream3-preview
- 9B MoE / 2B active, SigLIP encoder, 32K context
- Native detection, pointing, counting, OCR, reasoning mode
- Our Moondream2 scored 0.6385. MoE upgrade + reasoning may help on hard samples.
- Relevance: MEDIUM
6. Google Gemma 4 E2B
- HuggingFace: google/gemma-4-E2B-it
- 2.3B effective, audio+vision+text, Apache 2.0
- MMMU Pro: 44.2% β weaker vision but novel audio capability.
- Relevance: MEDIUM
7. MiniCPM-V 4.5
- HuggingFace: openbmb/MiniCPM-V-4_5
- SigLip + Qwen2.5-7B, ~8B params, surpasses GPT-4o on OpenCompass (77.0 avg)
- Switchable fast/deep thinking modes. License needs verification.
- Relevance: MEDIUM
LOW Priority β Tangentially Relevant
| Model | Why Low |
|---|---|
| Mistral Small 4 (119B MoE, 6B active) | Too large for classification pipeline |
| PaddleOCR-VL 1.5 (0.9B) | Document/OCR specialist, not garment tasks |
| MinerU2.5 (1.2B) | Document parsing only |
| LLaVA-OneVision-1.5 (4B/8B) | Not new (Sep 2025), we've moved to Qwen3+ |
| SmolDocling (256M) | Compact document model, not classification |
Key Takeaways
Granite4-Vision GRPO is the top action item. SFT-only Granite4 already beats our best Qwen3-VL-8B SFT+GRPO. GRPO/GTPO could yield a significant new SOTA.
Gemma 4 is the most significant new release. E4B and 26B-A4B offer excellent size/performance tradeoffs with Apache 2.0 and strong fine-tuning support. The 26B-A4B (4B active MoE) is especially promising.
InternVL3.5 deserves a second look. Our InternVL3 results were underwhelming (0.63-0.68), but 3.5 shows massive improvements (+29 MMMU points at 8B scale). Qwen3 backbone makes fine-tuning familiar.
No new Qwen VL releases this week. Qwen3.5-VL hasn't dropped separate VL checkpoints β VL capability is embedded in native Qwen3.5 via early fusion. We're already testing these.
Generated by Claude Code β Denali-AI Model Scout
Next scan: 2026-04-14