Daily Model Scout Report — 2026-04-02

by msudharsanan - opened Apr 2

Denali Advanced Integration org Apr 2

Daily Model Scout Report — 2026-04-02

Scope

Searched HuggingFace and the broader web for all VLM releases created or updated in the last 7 days (since 2026-03-26), plus significant recent releases that may have been missed in prior reports.

Current Best Models (Denali-AI)

Model	Weighted Score	Notes
qwen3-vl-8b-sft+grpo	0.9131	Best overall
qwen3-vl-8b-sft-grpo-nvfp4	0.8945	Best quantized
qwen3-vl-2b-sft-grpo-v9	0.8948	Best small model
qwen35-2b-base	0.8437	Best Qwen3.5 base (no fine-tune)

NEW / UPDATED MODELS FOUND

1. Qwen3.5 Small Series (0.8B / 2B / 4B / 9B) — Released March 2, 2026

HuggingFace: Qwen/Qwen3.5-4B, Qwen/Qwen3.5-9B, Qwen/Qwen3.5-2B, Qwen/Qwen3.5-0.8B
Architecture: Qwen3.5 native multimodal (early fusion), Gated Delta Networks + sparse MoE, Apache 2.0
Key capabilities: Native vision-language with 262K context, MMMU-Pro 69.2% (9B) vs Qwen3-VL-8B 56.6%, OmniDocBench 90.8 (family-wide)
Why it matters: The 4B model is a NEW size point not yet benchmarked by Denali-AI. The 9B outperforms Qwen3-VL-8B on MMMU-Pro by +12.6 points. We already have Qwen3.5-2B and 0.8B SFT results (82.44% and 79.44%) but the 4B could fill the gap between 2B and 9B. Also, the 9B with proper SFT+GRPO could potentially surpass the current 0.9131 champion.
Relevance: HIGH — Benchmark Qwen3.5-4B and Qwen3.5-9B with SFT+GRPO immediately

2. Kimi-K2.5 (1T total / 32B active, MoE) — Released January 27, 2026

HuggingFace: moonshotai/Kimi-K2.5
Architecture: MoE (1T total, 32B active, 384 experts), MoonViT 400M vision encoder, MIT license
Key capabilities: Native multimodal, outperforms GPT-5.2 and Claude 4.5 Opus on some vision benchmarks, Agent Swarm mode
Why it matters: Extremely strong vision capabilities. At 32B active params it fits on a single RTX PRO 6000 (98GB). MIT license is very permissive. However, fine-tuning a 1T MoE is non-trivial.
Relevance: MEDIUM — Worth evaluating as a zero-shot base, but fine-tuning complexity is high. Evaluate base performance first.

3. Granite 4.0 3B Vision (IBM) — Released April 1, 2026 (THIS WEEK)

HuggingFace: ibm-granite/granite-4.0-3b-vision
Architecture: LoRA adapter (~0.5B) on Granite 4.0 Micro (3.5B), DeepStack Injection for visual features
Key capabilities: Enterprise document extraction, table/chart/KVP parsing, 85.5% exact-match (zero-shot), 3rd among 2-4B models on VAREX
Why it matters: Designed for structured data extraction — conceptually similar to our JSON extraction task. DeepStack Injection is a novel approach worth understanding.
Relevance: MEDIUM — Specialized for document extraction rather than garment classification, but the structured extraction architecture may offer insights. Worth a quick base eval.

4. GLM-5V-Turbo (Z.ai) — Released April 1, 2026 (THIS WEEK)

Architecture: CogViT vision encoder, MTP architecture, 200K context
Key capabilities: Vision coding model, Design2Code 94.8 score, optimized for agentic workflows
Relevance: LOW — Closed source, coding-focused. Not suitable for fine-tuning on garment classification.

5. Qianfan-OCR (Baidu, 4B) — Released March 18, 2026

HuggingFace: baidu/Qianfan-OCR
Architecture: End-to-end VLM for document intelligence, 4B params
Key capabilities: #1 on OmniDocBench v1.5 (93.12), image-to-Markdown, prompt-driven extraction
Relevance: LOW — OCR-focused, unlikely to outperform on garment attribute recognition.

6. Moondream 3 Preview (9B total / 2B active, MoE) — Updated March 2026

HuggingFace: moondream/moondream3-preview
Architecture: MoE (64 experts, 8 active), 32K context, native pointing/counting/detection
Key capabilities: Frontier-level reasoning, grounded visual understanding, segmentation update in March 2026 with 40% faster inference
Why it matters: We already benchmarked Moondream2 (63.85% base). Moondream3 is a major architecture upgrade with MoE. 2B active params = very fast inference.
Relevance: MEDIUM — Worth benchmarking base performance to see if MoE upgrade closes the gap. Model is gated.

7. MolmoWeb-8B (Allen AI) — Released March 24, 2026

HuggingFace: allenai/MolmoWeb-8B
Architecture: Molmo 2 family, 8B params, visual web agent
Relevance: LOW — Web agent specialization doesn't transfer well to garment classification.

8. LightOnOCR-2-1B & DeepSeek-OCR-2 (3B)

Relevance: LOW — Both OCR-specialized, not relevant for garment classification.

PRIORITY ACTIONS

Immediate (This Week)

Benchmark Qwen3.5-4B with SFT+GRPO pipeline — Completely untested size point between our 2B (0.8948) and 9B variants.
Benchmark Qwen3.5-9B with SFT+GRPO — Significantly outperforms Qwen3-VL-8B on public vision benchmarks. Strong potential to beat the current 0.9131 champion.

Short-term (Next 1-2 Weeks)

Evaluate Kimi-K2.5 zero-shot on the 3.5k hard eval set — MIT license, 32B active params, fits on RTX PRO 6000.
Evaluate Moondream 3 Preview base — MoE architecture with only 2B active params could offer best speed/accuracy tradeoff.

Watching

Granite 4.0 3B Vision — Novel DeepStack architecture for structured extraction. Quick base eval would determine relevance.
GLM-5V-Turbo — Monitor for open-weight release. Currently API-only.

ARCHITECTURE SUMMARY

Model	Params (Active)	Architecture	License	Released	Priority
Qwen3.5-4B	4B	Native multimodal MoE	Apache 2.0	Mar 2, 2026	HIGH
Qwen3.5-9B	9B	Native multimodal MoE	Apache 2.0	Mar 2, 2026	HIGH
Kimi-K2.5	32B (of 1T)	MoE + MoonViT	MIT	Jan 27, 2026	MEDIUM
Moondream 3	2B (of 9B)	MoE	Gated	Mar 2026 update	MEDIUM
Granite 4.0 3B Vision	~4B	DeepStack LoRA	Apache 2.0	Apr 1, 2026	MEDIUM
GLM-5V-Turbo	Unknown	CogViT + MTP	Closed	Apr 1, 2026	LOW

NOTES

No new releases from Florence-3, PaliGemma3, InternVL4, or LLaVA-Next were found in the past 7 days.
The Qwen3.5 small series (released March 2) remains the most significant recent development for our pipeline, especially the untested 4B variant.
The InternVL3.5 family (August 2025) and SmolVLM2 (late 2025) have not been updated in the past week.
No new fashion/garment-specific VLM fine-tunes were found on HuggingFace.

Report generated 2026-04-02 by Denali-AI Model Scout

Sources:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment