Daily Model Scout Report β€” 2026-05-07

#24
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-05-07

Scope: New VLMs released in the last ~7 days (after 2026-04-30) that could move the needle on our 3,500-sample hard eval (current best: qwen3-vl-8b-sft+grpo at weighted_score 0.9131).

Method: Searched HF API image-text-to-text, image-to-text, visual-question-answering pipeline tags + targeted searches for Gemma-4, Qwen3.6, Qwen3-VL, InternVL3.5, Florence, MiniCPM-V, Phi-4, Cosmos, LFM2.5-VL, PaliGemma, SmolVLM, Hunyuan, Molmo, Moondream, LLaVA, Idefics, fashion/garment/clothing fine-tunes.

TL;DR: Google released Gemma 4 today (2026-05-07). Four sizes, native multimodal, Apache 2.0, with reported vision benchmarks that dramatically exceed Gemma 3 (MMMU Pro 76.9% vs 49.7%, MATH-Vision 85.6% vs 46.0%). This is the most important VLM release for our pipeline since Qwen3-VL-8B-Instruct, and a serious challenger to the Qwen3.6 family flagged yesterday. Two new candidates HIGH priority for benchmarking; everything else this week is incremental.


πŸ”₯ HIGH β€” benchmark immediately

1. google/gemma-4-31B-it β€” Relevance: HIGH

  • Released: 2026-05-07 (today) Β· Downloads (already): 8.59M Β· Likes: 2,552
  • Link: https://huggingface.co/google/gemma-4-31B-it
  • Architecture: Gemma4ForConditionalGeneration, dense 30.7B, 60 layers, hybrid local-sliding + global attention, p-RoPE on global layers, vision encoder ~550M params, 256K context window. Native system role. Configurable thinking mode.
  • License: Apache 2.0 (with Gemma usage policy) β€” clean for ReLo commercial.
  • Fits our hardware? Yes. BF16 β‰ˆ 62 GB β†’ comfortable on RTX PRO 6000 98 GB.
  • Why it might help:
    • Vision benchmarks reported by Google: MMMU Pro 76.9%, MATH-Vision 85.6%, OmniDocBench 0.131 edit distance (lower is better; Gemma 3 27B was 0.365). The OmniDocBench delta in particular is what matters for our hard fields (text-heavy: brand, size, defect callouts).
    • Capability list explicitly includes object detection, document/PDF parsing, screen/UI understanding, chart comprehension, multilingual OCR, handwriting, and pointing β€” all directly relevant to garment-attribute extraction.
    • Variable aspect ratio + variable resolution image processing β†’ no forced square crops, which has hurt our brand and size accuracy on tall/wide tag photos.
  • Recommendation: Probe with our standard inference template on the 100-sample eval first (low cost, fast signal). If overall β‰₯ Qwen3-VL-8B-Instruct base (78.14%), kick off the standard SFT+GRPO pipeline on apparel-capture-8k-train (7,672 rows) and run the 3.5k hard eval.
  • Risk / unknowns: Brand-new arch (gemma4); Liger / Unsloth / vLLM / NVFP4 paths likely all need wiring up. Treat the first run as the integration shakedown β€” same gotcha story as Granite Vision 4.x.

2. google/gemma-4-26B-A4B-it β€” Relevance: HIGH

  • Released: 2026-05-07 (today) Β· Downloads: 6.83M Β· Likes: 902
  • Link: https://huggingface.co/google/gemma-4-26B-A4B-it
  • Architecture: MoE, 25.2B total / 3.8B active, 30 layers, 128 experts (8 routed + 1 shared), 256K context, vision encoder ~550M. Gemma4ForConditionalGeneration.
  • License: Apache 2.0.
  • Fits our hardware? Yes. BF16 β‰ˆ 50 GB on disk; full set of experts has to be resident, but well under 98 GB.
  • Why it might help: Same vision capability stack as the 31B dense, but inference cost β‰ˆ a 4B dense model. For ReLo throughput at 8k+ images/day this is the more deployable shape. Reported MMMU Pro 73.8% / OmniDocBench 0.149 β€” close to the 31B and well above any of our current models on these proxies.
  • Recommendation: Pair this with #1 β€” if both train cleanly on the same recipe, the MoE almost certainly wins on $/throughput and is the better production candidate. This is the natural head-to-head with Qwen/Qwen3.6-35B-A3B (covered in yesterday's report #23).
  • Risk: MoE training in TRL/Unsloth has historically been the most painful path β€” expect to land a few patches. Cosmos-style downstream FP8/NVFP4 quant of the experts is also unproven.

🌟 MEDIUM β€” worth watching, run small probes

3. google/gemma-4-E4B-it β€” Relevance: MEDIUM

  • Released: 2026-05-07 Β· Downloads: 5.49M Β· Likes: 938
  • Link: https://huggingface.co/google/gemma-4-E4B-it
  • Architecture: 4.5B effective params (8B with PLE β€” Per-Layer Embeddings, used for lookups only), vision encoder ~150M, 128K context. Also includes audio.
  • Why MEDIUM: Reported MMMU Pro 52.6% / MATH-Vision 59.5% β€” well below the 26B/31B siblings and below Qwen3-VL-8B-Instruct base (78.14%). Probably won't win the 9-field eval as a base, but is the obvious replacement candidate for Qwen3.5-0.8B / 2B as our small-deploy class if SFT+GRPO closes the gap. Audio-capable variant could matter for future "describe-the-defect" voice annotation use cases.
  • Recommendation: 100-sample probe + light SFT compare against qwen3-vl-2b-sft-grpo-v9 (0.8948 weighted) before committing to a full pipeline.

4. google/gemma-4-E2B-it β€” Relevance: MEDIUM

  • Released: 2026-05-07 Β· Downloads: 3.40M Β· Likes: 579
  • Link: https://huggingface.co/google/gemma-4-E2B-it
  • Architecture: 2.3B effective (5.1B with PLE), vision + audio, 128K context.
  • Why MEDIUM: Same story as E4B but smaller. MMMU Pro 44.2%. Edge-class deploy candidate; would be the replacement for qwen35-08b-sft-merged if the SFT story ports cleanly.
  • Recommendation: Park for now; revisit after E4B probe lands.

5. nvidia/Cosmos-Reason2-2B β€” Relevance: MEDIUM (already in pipeline)

  • Released: 2026-04-30 (within window) Β· Downloads: 160k Β· Likes: 70
  • Link: https://huggingface.co/nvidia/Cosmos-Reason2-2B
  • Architecture: Qwen3-VL-2B fine-tune with Cosmos physical-AI post-training. qwen3_vl model_type β€” drop-in compatible with our existing Qwen3-VL training stack.
  • Status: Already trained on the sellability + SAM3.1 schema as run #740 on 2026-04-30 (per project_sellability_sam3_training memory). Eval comparison vs qwen3-vl-2b-sft-grpo-v9 (0.8948) is the open question.
  • Recommendation: Pull the eval results once #740 lands; no new action.

πŸͺΆ LOW β€” tangential or not in window

  • LiquidAI/LFM2.5-VL-450M β€” heavy fine-tune ecosystem this week (landslide, wildfire, VRSBench, methane), but the base was released 2026-04-08 (outside 7-day window). Edge-class only; we already have stronger small-model coverage.
  • Qwen3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) β€” released 2026-04-24, outside the 7-day window today; covered as HIGH in yesterday's report (#23). Still our other top training candidate alongside Gemma 4.
  • OpenGVLab/InternVL3_5-{1B,2B,8B,14B} β€” last touched 2025-08-29; nothing new this week. We already have InternVL3-2B in the eval table at 0.7222–0.7271 β€” well below current SOTA.
  • tencent/HunyuanOCR β€” strong OCR-specialist VLM (lastModified 2026-01-13) but not fresh this week.
  • openbmb/MiniCPM-V-4_5 β€” lastModified 2026-03-10; outside window.
  • AllenAI/MolmoAct2-* β€” released 2026-05-05, but action/robotics specialists (LIBERO, DROID, BimanualYAM). Not classification-relevant.
  • Qwen/Qwen3-VL-Embedding-2B β€” only sentence-similarity pipeline; not for our extraction task.
  • Spam/community noise: A very large fraction of this week's image-text-to-text uploads are unofficial Qwen3.6 / Qwen3.5-VL "uncensored / heretic / abliterated" community merges (LuffyTheFox, llmfan46, dealignai, etc.) and a wave of Gemma4-{26B,31B}-MLX-Q{4..8} and GGUF re-quantizations released within hours of Google's launch. Skipping all of these β€” we benchmark official base models, not community quants.
  • No new fashion/garment-specific VLMs in the window.

Suggested next actions (priority order)

  1. Smoke-test google/gemma-4-31B-it and google/gemma-4-26B-A4B-it with our standard inference template on the 100-sample eval. Apply the PeakBench serve-script registration gates (literal CONFIG, one-script-per-model, banner β€” transformers first line) β€” Gemma 4 will need a new serve-script template since it's a brand-new arch family.
  2. If 100-sample is competitive, register both in PeakBench (don't run ad-hoc inference scripts) and queue the 3.5k hard eval as the gating signal.
  3. Kick off SFT runs on the 7,672-row apparel-capture-8k-train for whichever clears the bar. Use the standard pipeline: train β†’ eval on 3.5k hard β†’ update JSON/wiki β†’ upload to HF with full model card + charts.
  4. Plan a Qwen3.6-35B-A3B vs Gemma-4-26B-A4B-it MoE head-to-head once both have SFT runs landed β€” this is the next "what do we actually deploy" question.

Report generated 2026-05-07 by /hf-model-scout. Comparison baseline: 3,500-sample hard eval, weighted_score from wiki-models-contrib/models/eval_all_results.json.

Sign up or log in to comment