Daily Model Scout Report β€” 2026-04-02

#2
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-04-02

Scope

Searched HuggingFace and the broader web for all VLM releases created or updated in the last 7 days (since 2026-03-26), plus significant recent releases that may have been missed in prior reports.

Current Best Models (Denali-AI)

Model Weighted Score Notes
qwen3-vl-8b-sft+grpo 0.9131 Best overall
qwen3-vl-8b-sft-grpo-nvfp4 0.8945 Best quantized
qwen3-vl-2b-sft-grpo-v9 0.8948 Best small model
qwen35-2b-base 0.8437 Best Qwen3.5 base (no fine-tune)

NEW / UPDATED MODELS FOUND

1. Qwen3.5 Small Series (0.8B / 2B / 4B / 9B) β€” Released March 2, 2026

  • HuggingFace: Qwen/Qwen3.5-4B, Qwen/Qwen3.5-9B, Qwen/Qwen3.5-2B, Qwen/Qwen3.5-0.8B
  • Architecture: Qwen3.5 native multimodal (early fusion), Gated Delta Networks + sparse MoE, Apache 2.0
  • Key capabilities: Native vision-language with 262K context, MMMU-Pro 69.2% (9B) vs Qwen3-VL-8B 56.6%, OmniDocBench 90.8 (family-wide)
  • Why it matters: The 4B model is a NEW size point not yet benchmarked by Denali-AI. The 9B outperforms Qwen3-VL-8B on MMMU-Pro by +12.6 points. We already have Qwen3.5-2B and 0.8B SFT results (82.44% and 79.44%) but the 4B could fill the gap between 2B and 9B. Also, the 9B with proper SFT+GRPO could potentially surpass the current 0.9131 champion.
  • Relevance: HIGH β€” Benchmark Qwen3.5-4B and Qwen3.5-9B with SFT+GRPO immediately

2. Kimi-K2.5 (1T total / 32B active, MoE) β€” Released January 27, 2026

  • HuggingFace: moonshotai/Kimi-K2.5
  • Architecture: MoE (1T total, 32B active, 384 experts), MoonViT 400M vision encoder, MIT license
  • Key capabilities: Native multimodal, outperforms GPT-5.2 and Claude 4.5 Opus on some vision benchmarks, Agent Swarm mode
  • Why it matters: Extremely strong vision capabilities. At 32B active params it fits on a single RTX PRO 6000 (98GB). MIT license is very permissive. However, fine-tuning a 1T MoE is non-trivial.
  • Relevance: MEDIUM β€” Worth evaluating as a zero-shot base, but fine-tuning complexity is high. Evaluate base performance first.

3. Granite 4.0 3B Vision (IBM) β€” Released April 1, 2026 (THIS WEEK)

  • HuggingFace: ibm-granite/granite-4.0-3b-vision
  • Architecture: LoRA adapter (~0.5B) on Granite 4.0 Micro (3.5B), DeepStack Injection for visual features
  • Key capabilities: Enterprise document extraction, table/chart/KVP parsing, 85.5% exact-match (zero-shot), 3rd among 2-4B models on VAREX
  • Why it matters: Designed for structured data extraction β€” conceptually similar to our JSON extraction task. DeepStack Injection is a novel approach worth understanding.
  • Relevance: MEDIUM β€” Specialized for document extraction rather than garment classification, but the structured extraction architecture may offer insights. Worth a quick base eval.

4. GLM-5V-Turbo (Z.ai) β€” Released April 1, 2026 (THIS WEEK)

  • Architecture: CogViT vision encoder, MTP architecture, 200K context
  • Key capabilities: Vision coding model, Design2Code 94.8 score, optimized for agentic workflows
  • Relevance: LOW β€” Closed source, coding-focused. Not suitable for fine-tuning on garment classification.

5. Qianfan-OCR (Baidu, 4B) β€” Released March 18, 2026

  • HuggingFace: baidu/Qianfan-OCR
  • Architecture: End-to-end VLM for document intelligence, 4B params
  • Key capabilities: #1 on OmniDocBench v1.5 (93.12), image-to-Markdown, prompt-driven extraction
  • Relevance: LOW β€” OCR-focused, unlikely to outperform on garment attribute recognition.

6. Moondream 3 Preview (9B total / 2B active, MoE) β€” Updated March 2026

  • HuggingFace: moondream/moondream3-preview
  • Architecture: MoE (64 experts, 8 active), 32K context, native pointing/counting/detection
  • Key capabilities: Frontier-level reasoning, grounded visual understanding, segmentation update in March 2026 with 40% faster inference
  • Why it matters: We already benchmarked Moondream2 (63.85% base). Moondream3 is a major architecture upgrade with MoE. 2B active params = very fast inference.
  • Relevance: MEDIUM β€” Worth benchmarking base performance to see if MoE upgrade closes the gap. Model is gated.

7. MolmoWeb-8B (Allen AI) β€” Released March 24, 2026

  • HuggingFace: allenai/MolmoWeb-8B
  • Architecture: Molmo 2 family, 8B params, visual web agent
  • Relevance: LOW β€” Web agent specialization doesn't transfer well to garment classification.

8. LightOnOCR-2-1B & DeepSeek-OCR-2 (3B)

  • Relevance: LOW β€” Both OCR-specialized, not relevant for garment classification.

PRIORITY ACTIONS

Immediate (This Week)

  1. Benchmark Qwen3.5-4B with SFT+GRPO pipeline β€” Completely untested size point between our 2B (0.8948) and 9B variants.
  2. Benchmark Qwen3.5-9B with SFT+GRPO β€” Significantly outperforms Qwen3-VL-8B on public vision benchmarks. Strong potential to beat the current 0.9131 champion.

Short-term (Next 1-2 Weeks)

  1. Evaluate Kimi-K2.5 zero-shot on the 3.5k hard eval set β€” MIT license, 32B active params, fits on RTX PRO 6000.
  2. Evaluate Moondream 3 Preview base β€” MoE architecture with only 2B active params could offer best speed/accuracy tradeoff.

Watching

  1. Granite 4.0 3B Vision β€” Novel DeepStack architecture for structured extraction. Quick base eval would determine relevance.
  2. GLM-5V-Turbo β€” Monitor for open-weight release. Currently API-only.

ARCHITECTURE SUMMARY

Model Params (Active) Architecture License Released Priority
Qwen3.5-4B 4B Native multimodal MoE Apache 2.0 Mar 2, 2026 HIGH
Qwen3.5-9B 9B Native multimodal MoE Apache 2.0 Mar 2, 2026 HIGH
Kimi-K2.5 32B (of 1T) MoE + MoonViT MIT Jan 27, 2026 MEDIUM
Moondream 3 2B (of 9B) MoE Gated Mar 2026 update MEDIUM
Granite 4.0 3B Vision ~4B DeepStack LoRA Apache 2.0 Apr 1, 2026 MEDIUM
GLM-5V-Turbo Unknown CogViT + MTP Closed Apr 1, 2026 LOW

NOTES

  • No new releases from Florence-3, PaliGemma3, InternVL4, or LLaVA-Next were found in the past 7 days.
  • The Qwen3.5 small series (released March 2) remains the most significant recent development for our pipeline, especially the untested 4B variant.
  • The InternVL3.5 family (August 2025) and SmolVLM2 (late 2025) have not been updated in the past week.
  • No new fashion/garment-specific VLM fine-tunes were found on HuggingFace.

Report generated 2026-04-02 by Denali-AI Model Scout

Sources:

Sign up or log in to comment