Daily Model Scout Report β 2026-04-02
#2
by msudharsanan - opened
Daily Model Scout Report β 2026-04-02
Scope
Searched HuggingFace and the broader web for all VLM releases created or updated in the last 7 days (since 2026-03-26), plus significant recent releases that may have been missed in prior reports.
Current Best Models (Denali-AI)
| Model | Weighted Score | Notes |
|---|---|---|
| qwen3-vl-8b-sft+grpo | 0.9131 | Best overall |
| qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 | Best quantized |
| qwen3-vl-2b-sft-grpo-v9 | 0.8948 | Best small model |
| qwen35-2b-base | 0.8437 | Best Qwen3.5 base (no fine-tune) |
NEW / UPDATED MODELS FOUND
1. Qwen3.5 Small Series (0.8B / 2B / 4B / 9B) β Released March 2, 2026
- HuggingFace: Qwen/Qwen3.5-4B, Qwen/Qwen3.5-9B, Qwen/Qwen3.5-2B, Qwen/Qwen3.5-0.8B
- Architecture: Qwen3.5 native multimodal (early fusion), Gated Delta Networks + sparse MoE, Apache 2.0
- Key capabilities: Native vision-language with 262K context, MMMU-Pro 69.2% (9B) vs Qwen3-VL-8B 56.6%, OmniDocBench 90.8 (family-wide)
- Why it matters: The 4B model is a NEW size point not yet benchmarked by Denali-AI. The 9B outperforms Qwen3-VL-8B on MMMU-Pro by +12.6 points. We already have Qwen3.5-2B and 0.8B SFT results (82.44% and 79.44%) but the 4B could fill the gap between 2B and 9B. Also, the 9B with proper SFT+GRPO could potentially surpass the current 0.9131 champion.
- Relevance: HIGH β Benchmark Qwen3.5-4B and Qwen3.5-9B with SFT+GRPO immediately
2. Kimi-K2.5 (1T total / 32B active, MoE) β Released January 27, 2026
- HuggingFace: moonshotai/Kimi-K2.5
- Architecture: MoE (1T total, 32B active, 384 experts), MoonViT 400M vision encoder, MIT license
- Key capabilities: Native multimodal, outperforms GPT-5.2 and Claude 4.5 Opus on some vision benchmarks, Agent Swarm mode
- Why it matters: Extremely strong vision capabilities. At 32B active params it fits on a single RTX PRO 6000 (98GB). MIT license is very permissive. However, fine-tuning a 1T MoE is non-trivial.
- Relevance: MEDIUM β Worth evaluating as a zero-shot base, but fine-tuning complexity is high. Evaluate base performance first.
3. Granite 4.0 3B Vision (IBM) β Released April 1, 2026 (THIS WEEK)
- HuggingFace: ibm-granite/granite-4.0-3b-vision
- Architecture: LoRA adapter (~0.5B) on Granite 4.0 Micro (3.5B), DeepStack Injection for visual features
- Key capabilities: Enterprise document extraction, table/chart/KVP parsing, 85.5% exact-match (zero-shot), 3rd among 2-4B models on VAREX
- Why it matters: Designed for structured data extraction β conceptually similar to our JSON extraction task. DeepStack Injection is a novel approach worth understanding.
- Relevance: MEDIUM β Specialized for document extraction rather than garment classification, but the structured extraction architecture may offer insights. Worth a quick base eval.
4. GLM-5V-Turbo (Z.ai) β Released April 1, 2026 (THIS WEEK)
- Architecture: CogViT vision encoder, MTP architecture, 200K context
- Key capabilities: Vision coding model, Design2Code 94.8 score, optimized for agentic workflows
- Relevance: LOW β Closed source, coding-focused. Not suitable for fine-tuning on garment classification.
5. Qianfan-OCR (Baidu, 4B) β Released March 18, 2026
- HuggingFace: baidu/Qianfan-OCR
- Architecture: End-to-end VLM for document intelligence, 4B params
- Key capabilities: #1 on OmniDocBench v1.5 (93.12), image-to-Markdown, prompt-driven extraction
- Relevance: LOW β OCR-focused, unlikely to outperform on garment attribute recognition.
6. Moondream 3 Preview (9B total / 2B active, MoE) β Updated March 2026
- HuggingFace: moondream/moondream3-preview
- Architecture: MoE (64 experts, 8 active), 32K context, native pointing/counting/detection
- Key capabilities: Frontier-level reasoning, grounded visual understanding, segmentation update in March 2026 with 40% faster inference
- Why it matters: We already benchmarked Moondream2 (63.85% base). Moondream3 is a major architecture upgrade with MoE. 2B active params = very fast inference.
- Relevance: MEDIUM β Worth benchmarking base performance to see if MoE upgrade closes the gap. Model is gated.
7. MolmoWeb-8B (Allen AI) β Released March 24, 2026
- HuggingFace: allenai/MolmoWeb-8B
- Architecture: Molmo 2 family, 8B params, visual web agent
- Relevance: LOW β Web agent specialization doesn't transfer well to garment classification.
8. LightOnOCR-2-1B & DeepSeek-OCR-2 (3B)
- Relevance: LOW β Both OCR-specialized, not relevant for garment classification.
PRIORITY ACTIONS
Immediate (This Week)
- Benchmark Qwen3.5-4B with SFT+GRPO pipeline β Completely untested size point between our 2B (0.8948) and 9B variants.
- Benchmark Qwen3.5-9B with SFT+GRPO β Significantly outperforms Qwen3-VL-8B on public vision benchmarks. Strong potential to beat the current 0.9131 champion.
Short-term (Next 1-2 Weeks)
- Evaluate Kimi-K2.5 zero-shot on the 3.5k hard eval set β MIT license, 32B active params, fits on RTX PRO 6000.
- Evaluate Moondream 3 Preview base β MoE architecture with only 2B active params could offer best speed/accuracy tradeoff.
Watching
- Granite 4.0 3B Vision β Novel DeepStack architecture for structured extraction. Quick base eval would determine relevance.
- GLM-5V-Turbo β Monitor for open-weight release. Currently API-only.
ARCHITECTURE SUMMARY
| Model | Params (Active) | Architecture | License | Released | Priority |
|---|---|---|---|---|---|
| Qwen3.5-4B | 4B | Native multimodal MoE | Apache 2.0 | Mar 2, 2026 | HIGH |
| Qwen3.5-9B | 9B | Native multimodal MoE | Apache 2.0 | Mar 2, 2026 | HIGH |
| Kimi-K2.5 | 32B (of 1T) | MoE + MoonViT | MIT | Jan 27, 2026 | MEDIUM |
| Moondream 3 | 2B (of 9B) | MoE | Gated | Mar 2026 update | MEDIUM |
| Granite 4.0 3B Vision | ~4B | DeepStack LoRA | Apache 2.0 | Apr 1, 2026 | MEDIUM |
| GLM-5V-Turbo | Unknown | CogViT + MTP | Closed | Apr 1, 2026 | LOW |
NOTES
- No new releases from Florence-3, PaliGemma3, InternVL4, or LLaVA-Next were found in the past 7 days.
- The Qwen3.5 small series (released March 2) remains the most significant recent development for our pipeline, especially the untested 4B variant.
- The InternVL3.5 family (August 2025) and SmolVLM2 (late 2025) have not been updated in the past week.
- No new fashion/garment-specific VLM fine-tunes were found on HuggingFace.
Report generated 2026-04-02 by Denali-AI Model Scout
Sources: