Daily Model Scout Report — 2026-05-05

#22

by msudharsanan - opened 9 days ago

Denali Advanced Integration org 9 days ago

Daily Model Scout Report — 2026-05-05

Current Denali-AI baseline (3,500-sample hard eval, `_overall.weighted_score`)

Model	Weighted score
qwen3-vl-8b-sft+grpo	0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9	0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4	0.8945 (best quantized)
qwen3-vl-8b-instruct-base	0.8751
qwen35-2b-base	0.8437

Note: granite4-vision-sft shows weighted_score=1.0144 in eval JSON; the >1.0 value is a known scoring artifact and is excluded from the headline comparison.

Summary

Quiet day for our domain. Only one new official VLM from a major lab in the last 24h (allenai/Molmo2-ER), and it is specialized for embodied/spatial reasoning rather than JSON attribute extraction. The community continues to flood image-text-to-text with Qwen3.6 derivatives (heretic/distilled/abliterated) — none are signal.

Medium Relevance — worth watching

1. allenai/Molmo2-ER

Released: 2026-05-04 (1 day ago) — 0 dl / 1 like at scrape time
Size: 5B (Qwen3-4B-Instruct-2507 backbone + SigLIP2 vision encoder)
License: Apache-2.0
Link: allenai/Molmo2-ER
Why it matters: Same backbone family as our top performer (Qwen3-VL); +17pt over Molmo2-4B on AI2's 13-benchmark embodied reasoning suite (Point-Bench, RefSpatial, BLINK, CV-Bench, ERQA, EmbSpatial, MindCube, SAT, VSI-Bench). Reportedly competitive with Gemini Robot-ER 1.5 Thinking and GPT-5 on grounding tasks.
Risk for our use case: Trained for pixel-accurate pointing and spatial reasoning, not structured JSON extraction. Likely lower out-of-the-box on our 9-field schema than qwen3-vl-8b-instruct-base (0.8751). Could help defect localization if we revisit SAM3.1's bbox/mask pipeline (currently 226-tuple training set), but we already have a working pipeline there.
Action: Skip benchmark on hard-eval-3500 unless we want a SigLIP2-backbone reference point. If we revisit garment-localization, this is the strongest candidate to bolt onto sellability scoring.

2. atbender/Qwen3.6-VL-REAP-26B-A3B

Released: 2026-04-19 (16 days ago) — 637 dl / 9 likes
Size: 26B-A3B (REAP-pruned from official Qwen/Qwen3.6-35B-A3B, ~26% reduction)
License: Apache-2.0
Link: atbender/Qwen3.6-VL-REAP-26B-A3B
Why it matters: Smaller MoE footprint than the official 35B-A3B; if quality holds, ~3B active params at inference is very attractive for our RTX PRO 6000. Quantized variants (NVFP4, AWQ, GGUF) are already appearing in the community.
Risk: Third-party prune, no published eval. Pruning hits language fluency before vision typically — would need our own benchmark, not a community LLMArena number.
Action: Hold. Re-evaluate if Qwen/Qwen3.6 lands an official VL-tagged release, or if atbender publishes pruning eval data.

Low Relevance — note and skip

Community Qwen3.6 derivatives (Claude-4.6/4.7-Opus-Distilled, heretic, abliterated, Wasserstein-NLS, etc.) — uncensoring/distillation experiments on Qwen3.6-27B and Qwen3.6-35B-A3B. Not relevant to a structured-extraction task.
Gemma-4 community fine-tunes (spoomplesmaxx-gemma4-31B, gemma4-bangla-synthdog, RickGemma4b_fr, gemma-4-E2B-it-PARO) — domain mismatch (uncensoring, hindi/bangla OCR, persona tuning).
LightOnOCR-2 fine-tunes (Phu-Hien/LightOnOCR-2-ft-04, AlgirdasV/LightOnOCR-2-ft-iam) — pure OCR pipeline, not multi-attribute classification.
Anony100/FashionVLM — no model card, no published architecture, no eval. Cannot evaluate.
google/gemma-4-*-it-assistant variants (2026-04-23) — assistant fine-tunes of the Gemma-4 family covered in earlier reports. No evidence of garment-domain gains.

Notable absences (still no movement)

No official Qwen/Qwen3.6-VL release. Qwen3.6-27B and Qwen3.6-35B-A3B (released 2026-04-15 and 2026-04-21, both Apache-2.0, both image-text-to-text-tagged) remain the only Qwen3.6 entry points. Worth re-confirming whether their image branch matches Qwen3-VL quality on a small sample before any hard-eval spend.
No Florence-3. Florence-2 still the only enc-dec contender; remains unable to handle our 9-field JSON well (~0.39 weighted overall).
No Phi-4-Multimodal update. Last meaningful release was the original.
No new MiniCPM-V, InternVL4, or DeepSeek-VL release in the last 14 days.

Recommended actions

No new benchmark spend required from today's scouting.
Watch list: allenai/Molmo2-ER if defect-localization comes back into scope; atbender/Qwen3.6-VL-REAP-26B-A3B pending an official Qwen3.6-VL release or third-party eval data.
Open question for prioritization: should we spend a benchmark slot on Qwen/Qwen3.6-27B (1.4M dl, Apache-2.0, dense, fits in 98GB at BF16) before continuing iteration on qwen3-vl-8b-sft+grpo? Architecture is image-text-to-text-tagged but vision quality vs. Qwen3-VL-8B is unverified for our schema.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Daily Model Scout Report — 2026-05-05

Daily Model Scout Report — 2026-05-05

Current Denali-AI baseline (3,500-sample hard eval, _overall.weighted_score)

Summary

Medium Relevance — worth watching

1. allenai/Molmo2-ER

2. atbender/Qwen3.6-VL-REAP-26B-A3B

Low Relevance — note and skip

Notable absences (still no movement)

Recommended actions

Current Denali-AI baseline (3,500-sample hard eval, `_overall.weighted_score`)