Daily Model Scout Report — 2026-05-01

#20

by msudharsanan - opened 14 days ago

Denali Advanced Integration org 14 days ago

Daily Model Scout Report — 2026-05-01

VLM landscape for the last 7 days (cutoff 2026-04-24). Several major upstream releases this week. Comparing against our current best (qwen3-vl-8b-sft+grpo @ 0.9131 weighted on the 3.5k-hard set; small-class best qwen3-vl-2b-sft-grpo-v9 @ 0.8948; quantized best qwen3-vl-8b-sft-grpo-nvfp4 @ 0.8945).

🔥 High relevance — benchmark immediately

Google Gemma-4 (released 2026-04-28, Apache 2.0)

First-party VLM family from Google, all image-text-to-text pipeline.

Model	Size	Notes
`google/gemma-4-31B-it`	31B dense	Flagship — 7.4M dl, 2460 likes, fits FP8 on 98 GB
`google/gemma-4-26B-A4B-it`	26B MoE / 4B active	Fast like 2B-class, capacity of 26B
`google/gemma-4-E4B-it`	~4B (any-to-any)	Edge variant, multimodal in/out
`google/gemma-4-E2B-it`	~2B (any-to-any)	Direct size-class match for our 2B small model

Why investigate: Apache-2.0, well-funded base. The 26B-A4B-it MoE in particular promises 2B-class inference speed with 26B-class quality — could reset our small-model leaderboard. E2B-it is a direct size match against our qwen3-vl-2b-sft-grpo-v9 (0.8948).

Qwen3.6 (released 2026-04-24, Apache 2.0)

Direct architectural successor to the Qwen3-VL/Qwen3.5-VL line we already use heavily.

Model	Size	Notes
`Qwen/Qwen3.6-35B-A3B`	35B MoE / 3B active	2.2M dl, 1543 likes
`Qwen/Qwen3.6-35B-A3B-FP8`	FP8 quant	Fits on 98 GB
`Qwen/Qwen3.6-27B`	27B dense	906k dl, 1049 likes
`Qwen/Qwen3.6-27B-FP8`	FP8 quant	Fits on 98 GB

Why investigate: MoE 35B-A3B should give Qwen3-VL-8B-class throughput with 35B-class capacity. Our SFT+GRPO+GTPO pipeline is already validated on Qwen3-VL — porting to 3.6 is a near-drop-in. Strong candidate to surpass our 0.9131 ceiling.

Qwen3.5 flagship VLMs (released 2026-04-24, Apache 2.0)

Larger Qwen3.5-VL variants that we previously skipped due to size.

Model	Size	Notes
`Qwen/Qwen3.5-122B-A10B-FP8`	122B MoE / 10B active, FP8	≈61 GB weights → fits 98 GB
`Qwen/Qwen3.5-122B-A10B-GPTQ-Int4`	122B MoE, Int4	≈30 GB → easily fits
`Qwen/Qwen3.5-397B-A17B`	397B flagship	Too large for 98 GB even in Int4

Why investigate: We already eval the 2B and 9B Qwen3.5 sizes; the 122B-A10B FP8/Int4 is the missing scale-up data point. 10B active params ≈ Qwen3-VL-8B inference cost.

⚠️ Medium relevance — worth watching

Mistral Medium 3.5 / Small 4 (2026-04-27 to 2026-04-30)

mistralai/Mistral-Medium-3.5-128B — 128B VLM (Mistral3ForConditionalGeneration, vision tower confirmed in config). License: other (Mistral non-commercial / research) — license blocker for production.
mistralai/Mistral-Small-4-119B-2603 — 119B VLM, Apache 2.0. Likely too large at full precision for our 98 GB GPU; will need community FP8/Int4.

Why medium: Vision-capable and Apache for the Small 4, but unproven on structured JSON extraction and untested in our SFT+GRPO pipeline. Watch for community quants; defer until size/license fit is clear.

IBM Granite 4.0 Vision (released 2026-04-30, Apache 2.0)

ibm-granite/granite-4.0-3b-vision — 3B VLM (149k dl, 108 likes)

Why medium: Direct size competitor to our 2B small-model class. We already have a Granite4-Vision-SFT entry in eval results (weighted_score 1.0144 — suspicious / needs validation). The 4.0-3b is a fresh upstream release; worth a clean eval on the 3.5k-hard set before committing to a SFT run.

❌ Low relevance / skip

DeepSeek-V4 family (DeepSeek-V4-Pro, -Flash, etc., 2026-04-27) — text-generation only, no vision tower.
Qwen SAE-Res checkpoints — interpretability artifacts, not models.
Various community fine-tunes of the above (uncensored / GGUF / MLX) — no benefit over upstream for our SFT pipeline.

Recommendation

Order of effort:

Eval base Qwen/Qwen3.6-35B-A3B-FP8, google/gemma-4-26B-A4B-it, google/gemma-4-E2B-it, Qwen/Qwen3.5-122B-A10B-Int4 on the 3.5k-hard set (no fine-tuning, just baseline). Cheap and tells us if any beat our base qwen3-vl-8b-instruct (0.8751).
If any baseline ≥0.85, kick off SFT+GRPO using our existing pipeline.
Re-validate Granite4-Vision-SFT 1.0144 score (out-of-range, possibly a scoring bug) before comparing with the new 3.0-3B vision base.

Auto-generated by /hf-model-scout on 2026-05-01.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment