Daily Model Scout Report β 2026-05-05
#22
by msudharsanan - opened
Daily Model Scout Report β 2026-05-05
Current Denali-AI baseline (3,500-sample hard eval, _overall.weighted_score)
| Model | Weighted score |
|---|---|
| qwen3-vl-8b-sft+grpo | 0.9131 (best overall) |
| qwen3-vl-2b-sft-grpo-v9 | 0.8948 (best small) |
| qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 (best quantized) |
| qwen3-vl-8b-instruct-base | 0.8751 |
| qwen35-2b-base | 0.8437 |
Note:
granite4-vision-sftshowsweighted_score=1.0144in eval JSON; the >1.0 value is a known scoring artifact and is excluded from the headline comparison.
Summary
Quiet day for our domain. Only one new official VLM from a major lab in the last 24h (allenai/Molmo2-ER), and it is specialized for embodied/spatial reasoning rather than JSON attribute extraction. The community continues to flood image-text-to-text with Qwen3.6 derivatives (heretic/distilled/abliterated) β none are signal.
Medium Relevance β worth watching
1. allenai/Molmo2-ER
- Released: 2026-05-04 (1 day ago) β 0 dl / 1 like at scrape time
- Size: 5B (Qwen3-4B-Instruct-2507 backbone + SigLIP2 vision encoder)
- License: Apache-2.0
- Link: allenai/Molmo2-ER
- Why it matters: Same backbone family as our top performer (Qwen3-VL); +17pt over
Molmo2-4Bon AI2's 13-benchmark embodied reasoning suite (Point-Bench, RefSpatial, BLINK, CV-Bench, ERQA, EmbSpatial, MindCube, SAT, VSI-Bench). Reportedly competitive with Gemini Robot-ER 1.5 Thinking and GPT-5 on grounding tasks. - Risk for our use case: Trained for pixel-accurate pointing and spatial reasoning, not structured JSON extraction. Likely lower out-of-the-box on our 9-field schema than
qwen3-vl-8b-instruct-base(0.8751). Could help defect localization if we revisit SAM3.1's bbox/mask pipeline (currently 226-tuple training set), but we already have a working pipeline there. - Action: Skip benchmark on hard-eval-3500 unless we want a SigLIP2-backbone reference point. If we revisit garment-localization, this is the strongest candidate to bolt onto sellability scoring.
2. atbender/Qwen3.6-VL-REAP-26B-A3B
- Released: 2026-04-19 (16 days ago) β 637 dl / 9 likes
- Size: 26B-A3B (REAP-pruned from official
Qwen/Qwen3.6-35B-A3B, ~26% reduction) - License: Apache-2.0
- Link: atbender/Qwen3.6-VL-REAP-26B-A3B
- Why it matters: Smaller MoE footprint than the official 35B-A3B; if quality holds, ~3B active params at inference is very attractive for our RTX PRO 6000. Quantized variants (NVFP4, AWQ, GGUF) are already appearing in the community.
- Risk: Third-party prune, no published eval. Pruning hits language fluency before vision typically β would need our own benchmark, not a community LLMArena number.
- Action: Hold. Re-evaluate if
Qwen/Qwen3.6lands an official VL-tagged release, or ifatbenderpublishes pruning eval data.
Low Relevance β note and skip
- Community Qwen3.6 derivatives (
Claude-4.6/4.7-Opus-Distilled,heretic,abliterated,Wasserstein-NLS, etc.) β uncensoring/distillation experiments onQwen3.6-27BandQwen3.6-35B-A3B. Not relevant to a structured-extraction task. - Gemma-4 community fine-tunes (
spoomplesmaxx-gemma4-31B,gemma4-bangla-synthdog,RickGemma4b_fr,gemma-4-E2B-it-PARO) β domain mismatch (uncensoring, hindi/bangla OCR, persona tuning). - LightOnOCR-2 fine-tunes (
Phu-Hien/LightOnOCR-2-ft-04,AlgirdasV/LightOnOCR-2-ft-iam) β pure OCR pipeline, not multi-attribute classification. Anony100/FashionVLMβ no model card, no published architecture, no eval. Cannot evaluate.google/gemma-4-*-it-assistantvariants (2026-04-23) β assistant fine-tunes of the Gemma-4 family covered in earlier reports. No evidence of garment-domain gains.
Notable absences (still no movement)
- No official
Qwen/Qwen3.6-VLrelease.Qwen3.6-27BandQwen3.6-35B-A3B(released 2026-04-15 and 2026-04-21, both Apache-2.0, bothimage-text-to-text-tagged) remain the only Qwen3.6 entry points. Worth re-confirming whether their image branch matchesQwen3-VLquality on a small sample before any hard-eval spend. - No Florence-3. Florence-2 still the only enc-dec contender; remains unable to handle our 9-field JSON well (~0.39 weighted overall).
- No Phi-4-Multimodal update. Last meaningful release was the original.
- No new MiniCPM-V, InternVL4, or DeepSeek-VL release in the last 14 days.
Recommended actions
- No new benchmark spend required from today's scouting.
- Watch list:
allenai/Molmo2-ERif defect-localization comes back into scope;atbender/Qwen3.6-VL-REAP-26B-A3Bpending an official Qwen3.6-VL release or third-party eval data. - Open question for prioritization: should we spend a benchmark slot on
Qwen/Qwen3.6-27B(1.4M dl, Apache-2.0, dense, fits in 98GB at BF16) before continuing iteration onqwen3-vl-8b-sft+grpo? Architecture is image-text-to-text-tagged but vision quality vs.Qwen3-VL-8Bis unverified for our schema.