Daily Model Scout Report β€” 2026-04-13

#11
by msudharsanan - opened
Denali Advanced Integration org

Daily VLM Scout Report β€” 2026-04-13

Scope: All new/updated Vision-Language Models on HuggingFace relevant to garment attribute classification (past 7 days)

Current Baseline (3,500-sample hard eval, weighted composite)

Model Weighted Score Notes
qwen3-vl-8b-sft+grpo 0.9131 Best overall
granite4-vision-sft 1.0144 Best raw score (SFT only, needs GRPO)
qwen3-vl-8b-sft-grpo-nvfp4 0.8945 Best quantized
qwen3-vl-2b-sft-grpo-v9 0.8948 Best small model
qwen35-2b-base 0.8437 Best Qwen3.5 base

Note: Granite4-Vision-SFT achieved 1.0144 weighted score with SFT alone β€” higher than our Qwen3-VL-8B SFT+GRPO. Adding GRPO/GTPO to Granite4 could push even further.


HIGH Priority β€” Benchmark Immediately

1. Google Gemma 4 (E4B / 26B-A4B)

  • Released: April 2, 2026
  • HuggingFace: google/gemma-4-E4B-it | google/gemma-4-26B-A4B-it
  • Architecture: Gemma 4 with Per-Layer Embeddings (PLE) and MoE routing
  • Parameters: E4B = 4.5B effective (8B w/ embeddings); 26B-A4B = 4B active / 26B total MoE
  • Vision: Learned 2D positions, variable aspect ratio, configurable token budgets (70-1120 tokens)
  • License: Apache 2.0
  • Benchmarks: MMMU Pro: E4B 52.6%, 26B-A4B 73.8% (vs Gemma 3 27B: 49.7%)
  • Fine-tuning: TRL, Unsloth, PEFT/QLoRA all supported (note: MoE variant has QLoRA limitations with bitsandbytes)
  • Why relevant: Brand new architecture from Google. The 26B-A4B activates only 4B params per token β€” faster than dense 8B models with potentially higher accuracy. Both are Apache 2.0.
  • Recommended action: Benchmark E4B-it and 26B-A4B-it base, then SFT the best performer.
  • Relevance: HIGH

2. OpenGVLab InternVL3.5 (2B / 8B)

  • HuggingFace: OpenGVLab/InternVL3_5-2B | OpenGVLab/InternVL3_5-8B
  • Architecture: ViT-MLP-LLM with InternViT-300M/6B + Qwen3 backbone
  • Parameters: 1B, 2B, 4B, 8B, 14B variants
  • Benchmarks: InternVL3.5-8B MMMU: 73.4 (vs InternVL3-8B: 44.3 β€” a +29 point jump); InternVL3.5-2B avg: 76.5 across 9 benchmarks
  • Training: Cascade RL (MPO offline then GSPO online), LoRA fine-tuning supported
  • License: Apache 2.0 (check model card)
  • Why relevant: We tested InternVL3-2B (scored 0.64-0.68 weighted). The 3.5 generation shows +16% reasoning gains and 4x inference speedup. Qwen3 backbone aligns with our existing expertise.
  • Recommended action: Benchmark InternVL3.5-2B and 8B base, then SFT+GRPO.
  • Relevance: HIGH

3. Granite 4.0 3B Vision β€” GRPO/GTPO Expansion

  • Released: March 27, 2026
  • HuggingFace: ibm-granite/granite-4.0-3b-vision
  • Architecture: LoRA adapter on Granite 4.0 Micro, modular vision/language
  • Parameters: 3B | License: Apache 2.0
  • Why relevant: Our SFT-only model scored 1.0144 weighted β€” the highest in our eval suite, surpassing qwen3-vl-8b-sft+grpo (0.9131). Adding GRPO and GTPO stages could push this significantly further.
  • Recommended action: Priority GRPO training run on granite4-vision-sft.
  • Relevance: HIGH

MEDIUM Priority β€” Worth Watching

4. Google Gemma 4 31B (Dense)

  • HuggingFace: google/gemma-4-31B-it
  • 31B dense, 256K context, MMMU Pro: 76.9%, Apache 2.0
  • Fits on RTX PRO 6000 but slow inference. Good as teacher model for distillation.
  • Relevance: MEDIUM

5. Moondream 3 Preview

  • HuggingFace: moondream/moondream3-preview
  • 9B MoE / 2B active, SigLIP encoder, 32K context
  • Native detection, pointing, counting, OCR, reasoning mode
  • Our Moondream2 scored 0.6385. MoE upgrade + reasoning may help on hard samples.
  • Relevance: MEDIUM

6. Google Gemma 4 E2B

  • HuggingFace: google/gemma-4-E2B-it
  • 2.3B effective, audio+vision+text, Apache 2.0
  • MMMU Pro: 44.2% β€” weaker vision but novel audio capability.
  • Relevance: MEDIUM

7. MiniCPM-V 4.5

  • HuggingFace: openbmb/MiniCPM-V-4_5
  • SigLip + Qwen2.5-7B, ~8B params, surpasses GPT-4o on OpenCompass (77.0 avg)
  • Switchable fast/deep thinking modes. License needs verification.
  • Relevance: MEDIUM

LOW Priority β€” Tangentially Relevant

Model Why Low
Mistral Small 4 (119B MoE, 6B active) Too large for classification pipeline
PaddleOCR-VL 1.5 (0.9B) Document/OCR specialist, not garment tasks
MinerU2.5 (1.2B) Document parsing only
LLaVA-OneVision-1.5 (4B/8B) Not new (Sep 2025), we've moved to Qwen3+
SmolDocling (256M) Compact document model, not classification

Key Takeaways

  1. Granite4-Vision GRPO is the top action item. SFT-only Granite4 already beats our best Qwen3-VL-8B SFT+GRPO. GRPO/GTPO could yield a significant new SOTA.

  2. Gemma 4 is the most significant new release. E4B and 26B-A4B offer excellent size/performance tradeoffs with Apache 2.0 and strong fine-tuning support. The 26B-A4B (4B active MoE) is especially promising.

  3. InternVL3.5 deserves a second look. Our InternVL3 results were underwhelming (0.63-0.68), but 3.5 shows massive improvements (+29 MMMU points at 8B scale). Qwen3 backbone makes fine-tuning familiar.

  4. No new Qwen VL releases this week. Qwen3.5-VL hasn't dropped separate VL checkpoints β€” VL capability is embedded in native Qwen3.5 via early fusion. We're already testing these.


Generated by Claude Code β€” Denali-AI Model Scout
Next scan: 2026-04-14

Sign up or log in to comment