Daily Model Scout Report — 2026-05-07

#24

by msudharsanan - opened 7 days ago

Denali Advanced Integration org 7 days ago

Daily Model Scout Report — 2026-05-07

Scope: New VLMs released in the last ~7 days (after 2026-04-30) that could move the needle on our 3,500-sample hard eval (current best: qwen3-vl-8b-sft+grpo at weighted_score 0.9131).

Method: Searched HF API image-text-to-text, image-to-text, visual-question-answering pipeline tags + targeted searches for Gemma-4, Qwen3.6, Qwen3-VL, InternVL3.5, Florence, MiniCPM-V, Phi-4, Cosmos, LFM2.5-VL, PaliGemma, SmolVLM, Hunyuan, Molmo, Moondream, LLaVA, Idefics, fashion/garment/clothing fine-tunes.

TL;DR: Google released Gemma 4 today (2026-05-07). Four sizes, native multimodal, Apache 2.0, with reported vision benchmarks that dramatically exceed Gemma 3 (MMMU Pro 76.9% vs 49.7%, MATH-Vision 85.6% vs 46.0%). This is the most important VLM release for our pipeline since Qwen3-VL-8B-Instruct, and a serious challenger to the Qwen3.6 family flagged yesterday. Two new candidates HIGH priority for benchmarking; everything else this week is incremental.

🔥 HIGH — benchmark immediately

1. google/gemma-4-31B-it — Relevance: HIGH

Released: 2026-05-07 (today) · Downloads (already): 8.59M · Likes: 2,552
Link: https://huggingface.co/google/gemma-4-31B-it
Architecture: Gemma4ForConditionalGeneration, dense 30.7B, 60 layers, hybrid local-sliding + global attention, p-RoPE on global layers, vision encoder ~550M params, 256K context window. Native system role. Configurable thinking mode.
License: Apache 2.0 (with Gemma usage policy) — clean for ReLo commercial.
Fits our hardware? Yes. BF16 ≈ 62 GB → comfortable on RTX PRO 6000 98 GB.
Why it might help:
- Vision benchmarks reported by Google: MMMU Pro 76.9%, MATH-Vision 85.6%, OmniDocBench 0.131 edit distance (lower is better; Gemma 3 27B was 0.365). The OmniDocBench delta in particular is what matters for our hard fields (text-heavy: brand, size, defect callouts).
- Capability list explicitly includes object detection, document/PDF parsing, screen/UI understanding, chart comprehension, multilingual OCR, handwriting, and pointing — all directly relevant to garment-attribute extraction.
- Variable aspect ratio + variable resolution image processing → no forced square crops, which has hurt our brand and size accuracy on tall/wide tag photos.
Recommendation: Probe with our standard inference template on the 100-sample eval first (low cost, fast signal). If overall ≥ Qwen3-VL-8B-Instruct base (78.14%), kick off the standard SFT+GRPO pipeline on apparel-capture-8k-train (7,672 rows) and run the 3.5k hard eval.
Risk / unknowns: Brand-new arch (gemma4); Liger / Unsloth / vLLM / NVFP4 paths likely all need wiring up. Treat the first run as the integration shakedown — same gotcha story as Granite Vision 4.x.

2. google/gemma-4-26B-A4B-it — Relevance: HIGH

Released: 2026-05-07 (today) · Downloads: 6.83M · Likes: 902
Link: https://huggingface.co/google/gemma-4-26B-A4B-it
Architecture: MoE, 25.2B total / 3.8B active, 30 layers, 128 experts (8 routed + 1 shared), 256K context, vision encoder ~550M. Gemma4ForConditionalGeneration.
License: Apache 2.0.
Fits our hardware? Yes. BF16 ≈ 50 GB on disk; full set of experts has to be resident, but well under 98 GB.
Why it might help: Same vision capability stack as the 31B dense, but inference cost ≈ a 4B dense model. For ReLo throughput at 8k+ images/day this is the more deployable shape. Reported MMMU Pro 73.8% / OmniDocBench 0.149 — close to the 31B and well above any of our current models on these proxies.
Recommendation: Pair this with #1 — if both train cleanly on the same recipe, the MoE almost certainly wins on $/throughput and is the better production candidate. This is the natural head-to-head with Qwen/Qwen3.6-35B-A3B (covered in yesterday's report #23).
Risk: MoE training in TRL/Unsloth has historically been the most painful path — expect to land a few patches. Cosmos-style downstream FP8/NVFP4 quant of the experts is also unproven.

🌟 MEDIUM — worth watching, run small probes

3. google/gemma-4-E4B-it — Relevance: MEDIUM

Released: 2026-05-07 · Downloads: 5.49M · Likes: 938
Link: https://huggingface.co/google/gemma-4-E4B-it
Architecture: 4.5B effective params (8B with PLE — Per-Layer Embeddings, used for lookups only), vision encoder ~150M, 128K context. Also includes audio.
Why MEDIUM: Reported MMMU Pro 52.6% / MATH-Vision 59.5% — well below the 26B/31B siblings and below Qwen3-VL-8B-Instruct base (78.14%). Probably won't win the 9-field eval as a base, but is the obvious replacement candidate for Qwen3.5-0.8B / 2B as our small-deploy class if SFT+GRPO closes the gap. Audio-capable variant could matter for future "describe-the-defect" voice annotation use cases.
Recommendation: 100-sample probe + light SFT compare against qwen3-vl-2b-sft-grpo-v9 (0.8948 weighted) before committing to a full pipeline.

4. google/gemma-4-E2B-it — Relevance: MEDIUM

Released: 2026-05-07 · Downloads: 3.40M · Likes: 579
Link: https://huggingface.co/google/gemma-4-E2B-it
Architecture: 2.3B effective (5.1B with PLE), vision + audio, 128K context.
Why MEDIUM: Same story as E4B but smaller. MMMU Pro 44.2%. Edge-class deploy candidate; would be the replacement for qwen35-08b-sft-merged if the SFT story ports cleanly.
Recommendation: Park for now; revisit after E4B probe lands.

5. nvidia/Cosmos-Reason2-2B — Relevance: MEDIUM (already in pipeline)

Released: 2026-04-30 (within window) · Downloads: 160k · Likes: 70
Link: https://huggingface.co/nvidia/Cosmos-Reason2-2B
Architecture: Qwen3-VL-2B fine-tune with Cosmos physical-AI post-training. qwen3_vl model_type — drop-in compatible with our existing Qwen3-VL training stack.
Status: Already trained on the sellability + SAM3.1 schema as run #740 on 2026-04-30 (per project_sellability_sam3_training memory). Eval comparison vs qwen3-vl-2b-sft-grpo-v9 (0.8948) is the open question.
Recommendation: Pull the eval results once #740 lands; no new action.

🪶 LOW — tangential or not in window

LiquidAI/LFM2.5-VL-450M — heavy fine-tune ecosystem this week (landslide, wildfire, VRSBench, methane), but the base was released 2026-04-08 (outside 7-day window). Edge-class only; we already have stronger small-model coverage.
Qwen3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) — released 2026-04-24, outside the 7-day window today; covered as HIGH in yesterday's report (#23). Still our other top training candidate alongside Gemma 4.
OpenGVLab/InternVL3_5-{1B,2B,8B,14B} — last touched 2025-08-29; nothing new this week. We already have InternVL3-2B in the eval table at 0.7222–0.7271 — well below current SOTA.
tencent/HunyuanOCR — strong OCR-specialist VLM (lastModified 2026-01-13) but not fresh this week.
openbmb/MiniCPM-V-4_5 — lastModified 2026-03-10; outside window.
AllenAI/MolmoAct2-* — released 2026-05-05, but action/robotics specialists (LIBERO, DROID, BimanualYAM). Not classification-relevant.
Qwen/Qwen3-VL-Embedding-2B — only sentence-similarity pipeline; not for our extraction task.
Spam/community noise: A very large fraction of this week's image-text-to-text uploads are unofficial Qwen3.6 / Qwen3.5-VL "uncensored / heretic / abliterated" community merges (LuffyTheFox, llmfan46, dealignai, etc.) and a wave of Gemma4-{26B,31B}-MLX-Q{4..8} and GGUF re-quantizations released within hours of Google's launch. Skipping all of these — we benchmark official base models, not community quants.
No new fashion/garment-specific VLMs in the window.

Suggested next actions (priority order)

Smoke-test google/gemma-4-31B-it and google/gemma-4-26B-A4B-it with our standard inference template on the 100-sample eval. Apply the PeakBench serve-script registration gates (literal CONFIG, one-script-per-model, banner — transformers first line) — Gemma 4 will need a new serve-script template since it's a brand-new arch family.
If 100-sample is competitive, register both in PeakBench (don't run ad-hoc inference scripts) and queue the 3.5k hard eval as the gating signal.
Kick off SFT runs on the 7,672-row apparel-capture-8k-train for whichever clears the bar. Use the standard pipeline: train → eval on 3.5k hard → update JSON/wiki → upload to HF with full model card + charts.
Plan a Qwen3.6-35B-A3B vs Gemma-4-26B-A4B-it MoE head-to-head once both have SFT runs landed — this is the next "what do we actually deploy" question.

Report generated 2026-05-07 by /hf-model-scout. Comparison baseline: 3,500-sample hard eval, weighted_score from wiki-models-contrib/models/eval_all_results.json.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment