Daily Model Scout Report β 2026-05-07
Daily Model Scout Report β 2026-05-07
Scope: New VLMs released in the last ~7 days (after 2026-04-30) that could move the needle on our 3,500-sample hard eval (current best: qwen3-vl-8b-sft+grpo at weighted_score 0.9131).
Method: Searched HF API image-text-to-text, image-to-text, visual-question-answering pipeline tags + targeted searches for Gemma-4, Qwen3.6, Qwen3-VL, InternVL3.5, Florence, MiniCPM-V, Phi-4, Cosmos, LFM2.5-VL, PaliGemma, SmolVLM, Hunyuan, Molmo, Moondream, LLaVA, Idefics, fashion/garment/clothing fine-tunes.
TL;DR: Google released Gemma 4 today (2026-05-07). Four sizes, native multimodal, Apache 2.0, with reported vision benchmarks that dramatically exceed Gemma 3 (MMMU Pro 76.9% vs 49.7%, MATH-Vision 85.6% vs 46.0%). This is the most important VLM release for our pipeline since Qwen3-VL-8B-Instruct, and a serious challenger to the Qwen3.6 family flagged yesterday. Two new candidates HIGH priority for benchmarking; everything else this week is incremental.
π₯ HIGH β benchmark immediately
1. google/gemma-4-31B-it β Relevance: HIGH
- Released: 2026-05-07 (today) Β· Downloads (already): 8.59M Β· Likes: 2,552
- Link: https://huggingface.co/google/gemma-4-31B-it
- Architecture:
Gemma4ForConditionalGeneration, dense 30.7B, 60 layers, hybrid local-sliding + global attention, p-RoPE on global layers, vision encoder ~550M params, 256K context window. Nativesystemrole. Configurable thinking mode. - License: Apache 2.0 (with Gemma usage policy) β clean for ReLo commercial.
- Fits our hardware? Yes. BF16 β 62 GB β comfortable on RTX PRO 6000 98 GB.
- Why it might help:
- Vision benchmarks reported by Google: MMMU Pro 76.9%, MATH-Vision 85.6%, OmniDocBench 0.131 edit distance (lower is better; Gemma 3 27B was 0.365). The OmniDocBench delta in particular is what matters for our hard fields (text-heavy: brand, size, defect callouts).
- Capability list explicitly includes object detection, document/PDF parsing, screen/UI understanding, chart comprehension, multilingual OCR, handwriting, and pointing β all directly relevant to garment-attribute extraction.
- Variable aspect ratio + variable resolution image processing β no forced square crops, which has hurt our brand and size accuracy on tall/wide tag photos.
- Recommendation: Probe with our standard inference template on the 100-sample eval first (low cost, fast signal). If overall β₯ Qwen3-VL-8B-Instruct base (78.14%), kick off the standard SFT+GRPO pipeline on
apparel-capture-8k-train(7,672 rows) and run the 3.5k hard eval. - Risk / unknowns: Brand-new arch (
gemma4); Liger / Unsloth / vLLM / NVFP4 paths likely all need wiring up. Treat the first run as the integration shakedown β same gotcha story as Granite Vision 4.x.
2. google/gemma-4-26B-A4B-it β Relevance: HIGH
- Released: 2026-05-07 (today) Β· Downloads: 6.83M Β· Likes: 902
- Link: https://huggingface.co/google/gemma-4-26B-A4B-it
- Architecture: MoE, 25.2B total / 3.8B active, 30 layers, 128 experts (8 routed + 1 shared), 256K context, vision encoder ~550M.
Gemma4ForConditionalGeneration. - License: Apache 2.0.
- Fits our hardware? Yes. BF16 β 50 GB on disk; full set of experts has to be resident, but well under 98 GB.
- Why it might help: Same vision capability stack as the 31B dense, but inference cost β a 4B dense model. For ReLo throughput at 8k+ images/day this is the more deployable shape. Reported MMMU Pro 73.8% / OmniDocBench 0.149 β close to the 31B and well above any of our current models on these proxies.
- Recommendation: Pair this with #1 β if both train cleanly on the same recipe, the MoE almost certainly wins on $/throughput and is the better production candidate. This is the natural head-to-head with
Qwen/Qwen3.6-35B-A3B(covered in yesterday's report #23). - Risk: MoE training in TRL/Unsloth has historically been the most painful path β expect to land a few patches. Cosmos-style downstream FP8/NVFP4 quant of the experts is also unproven.
π MEDIUM β worth watching, run small probes
3. google/gemma-4-E4B-it β Relevance: MEDIUM
- Released: 2026-05-07 Β· Downloads: 5.49M Β· Likes: 938
- Link: https://huggingface.co/google/gemma-4-E4B-it
- Architecture: 4.5B effective params (8B with PLE β Per-Layer Embeddings, used for lookups only), vision encoder ~150M, 128K context. Also includes audio.
- Why MEDIUM: Reported MMMU Pro 52.6% / MATH-Vision 59.5% β well below the 26B/31B siblings and below Qwen3-VL-8B-Instruct base (78.14%). Probably won't win the 9-field eval as a base, but is the obvious replacement candidate for Qwen3.5-0.8B / 2B as our small-deploy class if SFT+GRPO closes the gap. Audio-capable variant could matter for future "describe-the-defect" voice annotation use cases.
- Recommendation: 100-sample probe + light SFT compare against
qwen3-vl-2b-sft-grpo-v9(0.8948 weighted) before committing to a full pipeline.
4. google/gemma-4-E2B-it β Relevance: MEDIUM
- Released: 2026-05-07 Β· Downloads: 3.40M Β· Likes: 579
- Link: https://huggingface.co/google/gemma-4-E2B-it
- Architecture: 2.3B effective (5.1B with PLE), vision + audio, 128K context.
- Why MEDIUM: Same story as E4B but smaller. MMMU Pro 44.2%. Edge-class deploy candidate; would be the replacement for
qwen35-08b-sft-mergedif the SFT story ports cleanly. - Recommendation: Park for now; revisit after E4B probe lands.
5. nvidia/Cosmos-Reason2-2B β Relevance: MEDIUM (already in pipeline)
- Released: 2026-04-30 (within window) Β· Downloads: 160k Β· Likes: 70
- Link: https://huggingface.co/nvidia/Cosmos-Reason2-2B
- Architecture: Qwen3-VL-2B fine-tune with Cosmos physical-AI post-training.
qwen3_vlmodel_type β drop-in compatible with our existing Qwen3-VL training stack. - Status: Already trained on the sellability + SAM3.1 schema as run #740 on 2026-04-30 (per
project_sellability_sam3_trainingmemory). Eval comparison vsqwen3-vl-2b-sft-grpo-v9(0.8948) is the open question. - Recommendation: Pull the eval results once #740 lands; no new action.
πͺΆ LOW β tangential or not in window
- LiquidAI/LFM2.5-VL-450M β heavy fine-tune ecosystem this week (landslide, wildfire, VRSBench, methane), but the base was released 2026-04-08 (outside 7-day window). Edge-class only; we already have stronger small-model coverage.
- Qwen3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B) β released 2026-04-24, outside the 7-day window today; covered as HIGH in yesterday's report (#23). Still our other top training candidate alongside Gemma 4.
- OpenGVLab/InternVL3_5-{1B,2B,8B,14B} β last touched 2025-08-29; nothing new this week. We already have InternVL3-2B in the eval table at 0.7222β0.7271 β well below current SOTA.
- tencent/HunyuanOCR β strong OCR-specialist VLM (lastModified 2026-01-13) but not fresh this week.
- openbmb/MiniCPM-V-4_5 β lastModified 2026-03-10; outside window.
- AllenAI/MolmoAct2-* β released 2026-05-05, but action/robotics specialists (LIBERO, DROID, BimanualYAM). Not classification-relevant.
- Qwen/Qwen3-VL-Embedding-2B β only sentence-similarity pipeline; not for our extraction task.
- Spam/community noise: A very large fraction of this week's
image-text-to-textuploads are unofficial Qwen3.6 / Qwen3.5-VL "uncensored / heretic / abliterated" community merges (LuffyTheFox, llmfan46, dealignai, etc.) and a wave ofGemma4-{26B,31B}-MLX-Q{4..8}and GGUF re-quantizations released within hours of Google's launch. Skipping all of these β we benchmark official base models, not community quants. - No new fashion/garment-specific VLMs in the window.
Suggested next actions (priority order)
- Smoke-test
google/gemma-4-31B-itandgoogle/gemma-4-26B-A4B-itwith our standard inference template on the 100-sample eval. Apply the PeakBench serve-script registration gates (literal CONFIG, one-script-per-model, bannerβ transformersfirst line) β Gemma 4 will need a new serve-script template since it's a brand-new arch family. - If 100-sample is competitive, register both in PeakBench (don't run ad-hoc inference scripts) and queue the 3.5k hard eval as the gating signal.
- Kick off SFT runs on the 7,672-row
apparel-capture-8k-trainfor whichever clears the bar. Use the standard pipeline: train β eval on 3.5k hard β update JSON/wiki β upload to HF with full model card + charts. - Plan a Qwen3.6-35B-A3B vs Gemma-4-26B-A4B-it MoE head-to-head once both have SFT runs landed β this is the next "what do we actually deploy" question.
Report generated 2026-05-07 by /hf-model-scout. Comparison baseline: 3,500-sample hard eval, weighted_score from wiki-models-contrib/models/eval_all_results.json.