Daily Model Scout Report β€” 2026-04-16

#13
by msudharsanan - opened
Denali Advanced Integration org

Scope

Scan of HuggingFace for VLMs created or modified between 2026-04-09 and 2026-04-16, broad across architectures. Current baseline for comparison (weighted_score on our 3,500-sample hard eval):

Model Weighted Score
qwen3-vl-8b-sft+grpo 0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9 0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4 0.8945 (best quantized)
qwen35-2b-base 0.8437 (best Qwen3.5 base)

Candidates

1. Qwen/Qwen3.6-35B-A3B β€” Relevance: HIGH

  • Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
  • Created: 2026-04-15 (1 day old)
  • Size: 35B total / 3B active (MoE, 256 experts, 8 routed + 1 shared)
  • Pipeline: image-text-to-text β€” native multimodal (image + video)
  • Context: 256K native, 1M with YaRN
  • License: Apache 2.0
  • VRAM: ~72 GB BF16, ~36 GB FP8 β€” fits comfortably on RTX PRO 6000 98GB
  • Reported benchmarks: MMLU-Pro 85.2, GPQA 86.0, VideoMMU 83.7, SWE-bench Verified 73.4

Why it may beat our best (0.9131):

  • Direct Qwen3-VL successor β€” our pipeline (Qwen3-VL-8B SFT+GRPO) should port with minimal changes.
  • MoE 3B-active means inference speed comparable to our 2B models but capacity of a 35B dense model.
  • Same chat template / processor family, so our eval harness and reward engine likely work out of the box.
  • 301 HF likes already within 1 day of release signals strong community reception.

Action: Clone, run zero-shot on the 3,500 eval set, then SFT+GRPO with existing config. Strong contender to top the leaderboard.


2. google/gemma-4-E4B-it β€” Relevance: HIGH

  • Link: https://huggingface.co/google/gemma-4-E4B-it
  • Created: 2026-03-02; lastModified 2026-04-10 (within window)
  • Size: ~4.5B effective (8B with embeddings), dense; ~150M vision encoder
  • Pipeline: any-to-any (image + text + audio)
  • Context: 128K
  • License: Apache 2.0
  • Downloads: 1.8M β€” proven in the wild
  • Reported benchmarks: MMMU-Pro 52.6, MATH-Vision 59.5 (beats Gemma 3 27B)

Why it may beat our best (0.9131):

  • A different architectural family β€” first real non-Qwen competitor worth benchmarking since Granite-4-Vision. Our Granite4-Vision-SFT reached 88.25% on the 100-sample eval, so Gemma 4's stronger vision stack could exceed it.
  • Gemma 4 E4B reportedly outperforms Gemma 3 27B on vision, so its vision encoder is substantially stronger per-parameter.
  • Native function-calling makes structured JSON output stable pre-SFT β€” may close the format gap that Florence-2 suffers from.
  • 4.5B effective is a reasonable middle ground between our 2B and 8B deployments.

Action: Zero-shot eval first to see where Gemma's base vision stands vs. Qwen3-VL-8B base (0.8437-ish). If base is competitive with Qwen3-VL-2B (~0.80+ band), proceed with SFT+GRPO.


3. google/gemma-4-E2B-it β€” Relevance: HIGH

  • Link: https://huggingface.co/google/gemma-4-E2B-it
  • Created: 2026-03-02; lastModified 2026-04-10 (within window)
  • Size: ~5.1B parameters BF16 (E2B = "effective 2B" per Google naming)
  • Pipeline: any-to-any
  • License: Apache 2.0
  • Downloads: 1.4M

Why it matters: Direct size-class competitor to qwen3-vl-2b-sft-grpo-v9 (0.8948). If Gemma 4 E2B matches or beats Qwen3-VL-2B on our hard eval, we gain a second small-model family to hedge deployment options and diversify our ensemble.

Action: Run zero-shot first; benchmark decision contingent on baseline being β‰₯ 0.70.


4. google/gemma-4-31B-it β€” Relevance: MEDIUM

Why watch: Dense 31B VLM with strong reported vision benchmarks (MMMU 73.8, MATH-Vision 82.4 on the A4B sibling). However, 31B dense is 10x our active-compute budget vs. Qwen3.6-35B-A3B's 3B active β€” harder to justify unless zero-shot is dramatically stronger.

Action: Defer until after Qwen3.6-35B-A3B and Gemma 4 E4B results.


5. google/gemma-4-26B-A4B-it β€” Relevance: MEDIUM

Why watch: Closest direct peer to Qwen3.6-35B-A3B (both MoE, ~3B active). Good for apples-to-apples comparison across families at fixed active-compute.

Action: Benchmark in the same sweep as Qwen3.6-35B-A3B.


6. pingmong/Qwen3-VL-{2B,8B}-Instruct-fashion-product-images-small β€” Relevance: LOW

Why noted: Fashion-domain fine-tunes on the same base we use. Without a model card, training quality and label schema match are unverifiable. If their 9-field schema differs from ours, inference will be noise.

Action: Low priority. Skip unless bandwidth is free β€” our own SFT+GRPO pipeline likely already subsumes their training signal.


Skipped (surfaced but not relevant)

  • LiquidAI/LFM2.5-VL-450M β€” released Nov 2025, not new; model card explicitly notes it's "not well-suited for knowledge-intensive tasks."
  • zai-org/GLM-4.7-Flash β€” text-only, not a VLM.
  • OpenGVLab/InternVL3_5-8B β€” released Aug 2025, already beyond our scout window. Worth a dedicated revisit given CascadeRL and 16% reasoning gain vs. InternVL3, but out of scope for today.
  • Various community quantizations of Qwen3-VL, Gemma 4, etc. β€” not new architectures.
  • No new InternVL4, Florence-3, MiniCPM-V5, SmolVLM3, Idefics4, Molmo2, or Moondream3 releases detected.

Recommended Next Steps

  1. Benchmark Qwen/Qwen3.6-35B-A3B immediately β€” same Qwen family, highest ceiling, lowest porting cost.
  2. Zero-shot eval google/gemma-4-E4B-it and google/gemma-4-E2B-it β€” first serious non-Qwen contenders in months; decide SFT budget based on base scores.
  3. Fold gemma-4-26B-A4B-it into the same sweep as Qwen3.6-35B-A3B for fair MoE-vs-MoE comparison.

Best current benchmark to beat: qwen3-vl-8b-sft+grpo at 0.9131 weighted.

Sign up or log in to comment