Daily Model Scout Report — 2026-04-13

#11

by msudharsanan - opened Apr 13

Denali Advanced Integration org Apr 13

Daily VLM Scout Report — 2026-04-13

Scope: All new/updated Vision-Language Models on HuggingFace relevant to garment attribute classification (past 7 days)

Current Baseline (3,500-sample hard eval, weighted composite)

Model	Weighted Score	Notes
qwen3-vl-8b-sft+grpo	0.9131	Best overall
granite4-vision-sft	1.0144	Best raw score (SFT only, needs GRPO)
qwen3-vl-8b-sft-grpo-nvfp4	0.8945	Best quantized
qwen3-vl-2b-sft-grpo-v9	0.8948	Best small model
qwen35-2b-base	0.8437	Best Qwen3.5 base

Note: Granite4-Vision-SFT achieved 1.0144 weighted score with SFT alone — higher than our Qwen3-VL-8B SFT+GRPO. Adding GRPO/GTPO to Granite4 could push even further.

HIGH Priority — Benchmark Immediately

1. Google Gemma 4 (E4B / 26B-A4B)

Released: April 2, 2026
HuggingFace: google/gemma-4-E4B-it | google/gemma-4-26B-A4B-it
Architecture: Gemma 4 with Per-Layer Embeddings (PLE) and MoE routing
Parameters: E4B = 4.5B effective (8B w/ embeddings); 26B-A4B = 4B active / 26B total MoE
Vision: Learned 2D positions, variable aspect ratio, configurable token budgets (70-1120 tokens)
License: Apache 2.0
Benchmarks: MMMU Pro: E4B 52.6%, 26B-A4B 73.8% (vs Gemma 3 27B: 49.7%)
Fine-tuning: TRL, Unsloth, PEFT/QLoRA all supported (note: MoE variant has QLoRA limitations with bitsandbytes)
Why relevant: Brand new architecture from Google. The 26B-A4B activates only 4B params per token — faster than dense 8B models with potentially higher accuracy. Both are Apache 2.0.
Recommended action: Benchmark E4B-it and 26B-A4B-it base, then SFT the best performer.
Relevance: HIGH

2. OpenGVLab InternVL3.5 (2B / 8B)

HuggingFace: OpenGVLab/InternVL3_5-2B | OpenGVLab/InternVL3_5-8B
Architecture: ViT-MLP-LLM with InternViT-300M/6B + Qwen3 backbone
Parameters: 1B, 2B, 4B, 8B, 14B variants
Benchmarks: InternVL3.5-8B MMMU: 73.4 (vs InternVL3-8B: 44.3 — a +29 point jump); InternVL3.5-2B avg: 76.5 across 9 benchmarks
Training: Cascade RL (MPO offline then GSPO online), LoRA fine-tuning supported
License: Apache 2.0 (check model card)
Why relevant: We tested InternVL3-2B (scored 0.64-0.68 weighted). The 3.5 generation shows +16% reasoning gains and 4x inference speedup. Qwen3 backbone aligns with our existing expertise.
Recommended action: Benchmark InternVL3.5-2B and 8B base, then SFT+GRPO.
Relevance: HIGH

3. Granite 4.0 3B Vision — GRPO/GTPO Expansion

Released: March 27, 2026
HuggingFace: ibm-granite/granite-4.0-3b-vision
Architecture: LoRA adapter on Granite 4.0 Micro, modular vision/language
Parameters: 3B | License: Apache 2.0
Why relevant: Our SFT-only model scored 1.0144 weighted — the highest in our eval suite, surpassing qwen3-vl-8b-sft+grpo (0.9131). Adding GRPO and GTPO stages could push this significantly further.
Recommended action: Priority GRPO training run on granite4-vision-sft.
Relevance: HIGH

MEDIUM Priority — Worth Watching

4. Google Gemma 4 31B (Dense)

HuggingFace: google/gemma-4-31B-it
31B dense, 256K context, MMMU Pro: 76.9%, Apache 2.0
Fits on RTX PRO 6000 but slow inference. Good as teacher model for distillation.
Relevance: MEDIUM

5. Moondream 3 Preview

HuggingFace: moondream/moondream3-preview
9B MoE / 2B active, SigLIP encoder, 32K context
Native detection, pointing, counting, OCR, reasoning mode
Our Moondream2 scored 0.6385. MoE upgrade + reasoning may help on hard samples.
Relevance: MEDIUM

6. Google Gemma 4 E2B

HuggingFace: google/gemma-4-E2B-it
2.3B effective, audio+vision+text, Apache 2.0
MMMU Pro: 44.2% — weaker vision but novel audio capability.
Relevance: MEDIUM

7. MiniCPM-V 4.5

HuggingFace: openbmb/MiniCPM-V-4_5
SigLip + Qwen2.5-7B, ~8B params, surpasses GPT-4o on OpenCompass (77.0 avg)
Switchable fast/deep thinking modes. License needs verification.
Relevance: MEDIUM

LOW Priority — Tangentially Relevant

Model	Why Low
Mistral Small 4 (119B MoE, 6B active)	Too large for classification pipeline
PaddleOCR-VL 1.5 (0.9B)	Document/OCR specialist, not garment tasks
MinerU2.5 (1.2B)	Document parsing only
LLaVA-OneVision-1.5 (4B/8B)	Not new (Sep 2025), we've moved to Qwen3+
SmolDocling (256M)	Compact document model, not classification

Key Takeaways

Granite4-Vision GRPO is the top action item. SFT-only Granite4 already beats our best Qwen3-VL-8B SFT+GRPO. GRPO/GTPO could yield a significant new SOTA.
Gemma 4 is the most significant new release. E4B and 26B-A4B offer excellent size/performance tradeoffs with Apache 2.0 and strong fine-tuning support. The 26B-A4B (4B active MoE) is especially promising.
InternVL3.5 deserves a second look. Our InternVL3 results were underwhelming (0.63-0.68), but 3.5 shows massive improvements (+29 MMMU points at 8B scale). Qwen3 backbone makes fine-tuning familiar.
No new Qwen VL releases this week. Qwen3.5-VL hasn't dropped separate VL checkpoints — VL capability is embedded in native Qwen3.5 via early fusion. We're already testing these.

Generated by Claude Code — Denali-AI Model Scout
Next scan: 2026-04-14

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment