Daily Model Scout Report β 2026-04-16
Scope
Scan of HuggingFace for VLMs created or modified between 2026-04-09 and 2026-04-16, broad across architectures. Current baseline for comparison (weighted_score on our 3,500-sample hard eval):
| Model | Weighted Score |
|---|---|
| qwen3-vl-8b-sft+grpo | 0.9131 (best overall) |
| qwen3-vl-2b-sft-grpo-v9 | 0.8948 (best small) |
| qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 (best quantized) |
| qwen35-2b-base | 0.8437 (best Qwen3.5 base) |
Candidates
1. Qwen/Qwen3.6-35B-A3B β Relevance: HIGH
- Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
- Created: 2026-04-15 (1 day old)
- Size: 35B total / 3B active (MoE, 256 experts, 8 routed + 1 shared)
- Pipeline:
image-text-to-textβ native multimodal (image + video) - Context: 256K native, 1M with YaRN
- License: Apache 2.0
- VRAM: ~72 GB BF16, ~36 GB FP8 β fits comfortably on RTX PRO 6000 98GB
- Reported benchmarks: MMLU-Pro 85.2, GPQA 86.0, VideoMMU 83.7, SWE-bench Verified 73.4
Why it may beat our best (0.9131):
- Direct Qwen3-VL successor β our pipeline (Qwen3-VL-8B SFT+GRPO) should port with minimal changes.
- MoE 3B-active means inference speed comparable to our 2B models but capacity of a 35B dense model.
- Same chat template / processor family, so our eval harness and reward engine likely work out of the box.
- 301 HF likes already within 1 day of release signals strong community reception.
Action: Clone, run zero-shot on the 3,500 eval set, then SFT+GRPO with existing config. Strong contender to top the leaderboard.
2. google/gemma-4-E4B-it β Relevance: HIGH
- Link: https://huggingface.co/google/gemma-4-E4B-it
- Created: 2026-03-02; lastModified 2026-04-10 (within window)
- Size: ~4.5B effective (8B with embeddings), dense; ~150M vision encoder
- Pipeline:
any-to-any(image + text + audio) - Context: 128K
- License: Apache 2.0
- Downloads: 1.8M β proven in the wild
- Reported benchmarks: MMMU-Pro 52.6, MATH-Vision 59.5 (beats Gemma 3 27B)
Why it may beat our best (0.9131):
- A different architectural family β first real non-Qwen competitor worth benchmarking since Granite-4-Vision. Our Granite4-Vision-SFT reached 88.25% on the 100-sample eval, so Gemma 4's stronger vision stack could exceed it.
- Gemma 4 E4B reportedly outperforms Gemma 3 27B on vision, so its vision encoder is substantially stronger per-parameter.
- Native function-calling makes structured JSON output stable pre-SFT β may close the format gap that Florence-2 suffers from.
- 4.5B effective is a reasonable middle ground between our 2B and 8B deployments.
Action: Zero-shot eval first to see where Gemma's base vision stands vs. Qwen3-VL-8B base (0.8437-ish). If base is competitive with Qwen3-VL-2B (~0.80+ band), proceed with SFT+GRPO.
3. google/gemma-4-E2B-it β Relevance: HIGH
- Link: https://huggingface.co/google/gemma-4-E2B-it
- Created: 2026-03-02; lastModified 2026-04-10 (within window)
- Size: ~5.1B parameters BF16 (E2B = "effective 2B" per Google naming)
- Pipeline:
any-to-any - License: Apache 2.0
- Downloads: 1.4M
Why it matters: Direct size-class competitor to qwen3-vl-2b-sft-grpo-v9 (0.8948). If Gemma 4 E2B matches or beats Qwen3-VL-2B on our hard eval, we gain a second small-model family to hedge deployment options and diversify our ensemble.
Action: Run zero-shot first; benchmark decision contingent on baseline being β₯ 0.70.
4. google/gemma-4-31B-it β Relevance: MEDIUM
- Link: https://huggingface.co/google/gemma-4-31B-it
- Size: 31.3B dense
- License: Apache 2.0
- VRAM: ~63 GB BF16 β fits on RTX PRO 6000 98GB
- Downloads: 3.2M
Why watch: Dense 31B VLM with strong reported vision benchmarks (MMMU 73.8, MATH-Vision 82.4 on the A4B sibling). However, 31B dense is 10x our active-compute budget vs. Qwen3.6-35B-A3B's 3B active β harder to justify unless zero-shot is dramatically stronger.
Action: Defer until after Qwen3.6-35B-A3B and Gemma 4 E4B results.
5. google/gemma-4-26B-A4B-it β Relevance: MEDIUM
- Link: https://huggingface.co/google/gemma-4-26B-A4B-it
- Size: 25.2B total / 3.8B active (MoE)
- License: Apache 2.0
- Reported: MMMU-Pro 73.8, MATH-Vision 82.4
Why watch: Closest direct peer to Qwen3.6-35B-A3B (both MoE, ~3B active). Good for apples-to-apples comparison across families at fixed active-compute.
Action: Benchmark in the same sweep as Qwen3.6-35B-A3B.
6. pingmong/Qwen3-VL-{2B,8B}-Instruct-fashion-product-images-small β Relevance: LOW
- Links:
- https://huggingface.co/pingmong/Qwen3-VL-8B-Instruct-fashion-product-images-small (created 2026-04-09)
- https://huggingface.co/pingmong/Qwen3-VL-2B-Instruct-fashion-product-images-small (created 2026-04-10)
- Size: 2B / 9B (Qwen3-VL base)
- Model card: missing β no training data, task, or metrics documented
Why noted: Fashion-domain fine-tunes on the same base we use. Without a model card, training quality and label schema match are unverifiable. If their 9-field schema differs from ours, inference will be noise.
Action: Low priority. Skip unless bandwidth is free β our own SFT+GRPO pipeline likely already subsumes their training signal.
Skipped (surfaced but not relevant)
LiquidAI/LFM2.5-VL-450Mβ released Nov 2025, not new; model card explicitly notes it's "not well-suited for knowledge-intensive tasks."zai-org/GLM-4.7-Flashβ text-only, not a VLM.OpenGVLab/InternVL3_5-8Bβ released Aug 2025, already beyond our scout window. Worth a dedicated revisit given CascadeRL and 16% reasoning gain vs. InternVL3, but out of scope for today.- Various community quantizations of Qwen3-VL, Gemma 4, etc. β not new architectures.
- No new InternVL4, Florence-3, MiniCPM-V5, SmolVLM3, Idefics4, Molmo2, or Moondream3 releases detected.
Recommended Next Steps
- Benchmark
Qwen/Qwen3.6-35B-A3Bimmediately β same Qwen family, highest ceiling, lowest porting cost. - Zero-shot eval
google/gemma-4-E4B-itandgoogle/gemma-4-E2B-itβ first serious non-Qwen contenders in months; decide SFT budget based on base scores. - Fold
gemma-4-26B-A4B-itinto the same sweep as Qwen3.6-35B-A3B for fair MoE-vs-MoE comparison.
Best current benchmark to beat: qwen3-vl-8b-sft+grpo at 0.9131 weighted.