Daily Model Scout Report — 2026-04-16

#13

by msudharsanan - opened 28 days ago

Denali Advanced Integration org 28 days ago

Scope

Scan of HuggingFace for VLMs created or modified between 2026-04-09 and 2026-04-16, broad across architectures. Current baseline for comparison (weighted_score on our 3,500-sample hard eval):

Model	Weighted Score
qwen3-vl-8b-sft+grpo	0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9	0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4	0.8945 (best quantized)
qwen35-2b-base	0.8437 (best Qwen3.5 base)

Candidates

1. `Qwen/Qwen3.6-35B-A3B` — Relevance: HIGH

Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Created: 2026-04-15 (1 day old)
Size: 35B total / 3B active (MoE, 256 experts, 8 routed + 1 shared)
Pipeline: image-text-to-text — native multimodal (image + video)
Context: 256K native, 1M with YaRN
License: Apache 2.0
VRAM: ~72 GB BF16, ~36 GB FP8 — fits comfortably on RTX PRO 6000 98GB
Reported benchmarks: MMLU-Pro 85.2, GPQA 86.0, VideoMMU 83.7, SWE-bench Verified 73.4

Why it may beat our best (0.9131):

Direct Qwen3-VL successor — our pipeline (Qwen3-VL-8B SFT+GRPO) should port with minimal changes.
MoE 3B-active means inference speed comparable to our 2B models but capacity of a 35B dense model.
Same chat template / processor family, so our eval harness and reward engine likely work out of the box.
301 HF likes already within 1 day of release signals strong community reception.

Action: Clone, run zero-shot on the 3,500 eval set, then SFT+GRPO with existing config. Strong contender to top the leaderboard.

2. `google/gemma-4-E4B-it` — Relevance: HIGH

Link: https://huggingface.co/google/gemma-4-E4B-it
Created: 2026-03-02; lastModified 2026-04-10 (within window)
Size: ~4.5B effective (8B with embeddings), dense; ~150M vision encoder
Pipeline: any-to-any (image + text + audio)
Context: 128K
License: Apache 2.0
Downloads: 1.8M — proven in the wild
Reported benchmarks: MMMU-Pro 52.6, MATH-Vision 59.5 (beats Gemma 3 27B)

Why it may beat our best (0.9131):

A different architectural family — first real non-Qwen competitor worth benchmarking since Granite-4-Vision. Our Granite4-Vision-SFT reached 88.25% on the 100-sample eval, so Gemma 4's stronger vision stack could exceed it.
Gemma 4 E4B reportedly outperforms Gemma 3 27B on vision, so its vision encoder is substantially stronger per-parameter.
Native function-calling makes structured JSON output stable pre-SFT — may close the format gap that Florence-2 suffers from.
4.5B effective is a reasonable middle ground between our 2B and 8B deployments.

Action: Zero-shot eval first to see where Gemma's base vision stands vs. Qwen3-VL-8B base (0.8437-ish). If base is competitive with Qwen3-VL-2B (~0.80+ band), proceed with SFT+GRPO.

3. `google/gemma-4-E2B-it` — Relevance: HIGH

Link: https://huggingface.co/google/gemma-4-E2B-it
Created: 2026-03-02; lastModified 2026-04-10 (within window)
Size: ~5.1B parameters BF16 (E2B = "effective 2B" per Google naming)
Pipeline: any-to-any
License: Apache 2.0
Downloads: 1.4M

Why it matters: Direct size-class competitor to qwen3-vl-2b-sft-grpo-v9 (0.8948). If Gemma 4 E2B matches or beats Qwen3-VL-2B on our hard eval, we gain a second small-model family to hedge deployment options and diversify our ensemble.

Action: Run zero-shot first; benchmark decision contingent on baseline being ≥ 0.70.

4. `google/gemma-4-31B-it` — Relevance: MEDIUM

Link: https://huggingface.co/google/gemma-4-31B-it
Size: 31.3B dense
License: Apache 2.0
VRAM: ~63 GB BF16 — fits on RTX PRO 6000 98GB
Downloads: 3.2M

Why watch: Dense 31B VLM with strong reported vision benchmarks (MMMU 73.8, MATH-Vision 82.4 on the A4B sibling). However, 31B dense is 10x our active-compute budget vs. Qwen3.6-35B-A3B's 3B active — harder to justify unless zero-shot is dramatically stronger.

Action: Defer until after Qwen3.6-35B-A3B and Gemma 4 E4B results.

5. `google/gemma-4-26B-A4B-it` — Relevance: MEDIUM

Link: https://huggingface.co/google/gemma-4-26B-A4B-it
Size: 25.2B total / 3.8B active (MoE)
License: Apache 2.0
Reported: MMMU-Pro 73.8, MATH-Vision 82.4

Why watch: Closest direct peer to Qwen3.6-35B-A3B (both MoE, ~3B active). Good for apples-to-apples comparison across families at fixed active-compute.

Action: Benchmark in the same sweep as Qwen3.6-35B-A3B.

6. `pingmong/Qwen3-VL-{2B,8B}-Instruct-fashion-product-images-small` — Relevance: LOW

Links:
- https://huggingface.co/pingmong/Qwen3-VL-8B-Instruct-fashion-product-images-small (created 2026-04-09)
- https://huggingface.co/pingmong/Qwen3-VL-2B-Instruct-fashion-product-images-small (created 2026-04-10)
Size: 2B / 9B (Qwen3-VL base)
Model card: missing — no training data, task, or metrics documented

Why noted: Fashion-domain fine-tunes on the same base we use. Without a model card, training quality and label schema match are unverifiable. If their 9-field schema differs from ours, inference will be noise.

Action: Low priority. Skip unless bandwidth is free — our own SFT+GRPO pipeline likely already subsumes their training signal.

Skipped (surfaced but not relevant)

LiquidAI/LFM2.5-VL-450M — released Nov 2025, not new; model card explicitly notes it's "not well-suited for knowledge-intensive tasks."
zai-org/GLM-4.7-Flash — text-only, not a VLM.
OpenGVLab/InternVL3_5-8B — released Aug 2025, already beyond our scout window. Worth a dedicated revisit given CascadeRL and 16% reasoning gain vs. InternVL3, but out of scope for today.
Various community quantizations of Qwen3-VL, Gemma 4, etc. — not new architectures.
No new InternVL4, Florence-3, MiniCPM-V5, SmolVLM3, Idefics4, Molmo2, or Moondream3 releases detected.

Recommended Next Steps

Benchmark Qwen/Qwen3.6-35B-A3B immediately — same Qwen family, highest ceiling, lowest porting cost.
Zero-shot eval google/gemma-4-E4B-it and google/gemma-4-E2B-it — first serious non-Qwen contenders in months; decide SFT budget based on base scores.
Fold gemma-4-26B-A4B-it into the same sweep as Qwen3.6-35B-A3B for fair MoE-vs-MoE comparison.

Best current benchmark to beat: qwen3-vl-8b-sft+grpo at 0.9131 weighted.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Daily Model Scout Report — 2026-04-16

Scope

Candidates

1. Qwen/Qwen3.6-35B-A3B — Relevance: HIGH

2. google/gemma-4-E4B-it — Relevance: HIGH

3. google/gemma-4-E2B-it — Relevance: HIGH

4. google/gemma-4-31B-it — Relevance: MEDIUM

5. google/gemma-4-26B-A4B-it — Relevance: MEDIUM

6. pingmong/Qwen3-VL-{2B,8B}-Instruct-fashion-product-images-small — Relevance: LOW