Daily Model Scout Report β€” 2026-05-05

#22
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-05-05

Current Denali-AI baseline (3,500-sample hard eval, _overall.weighted_score)

Model Weighted score
qwen3-vl-8b-sft+grpo 0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9 0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4 0.8945 (best quantized)
qwen3-vl-8b-instruct-base 0.8751
qwen35-2b-base 0.8437

Note: granite4-vision-sft shows weighted_score=1.0144 in eval JSON; the >1.0 value is a known scoring artifact and is excluded from the headline comparison.


Summary

Quiet day for our domain. Only one new official VLM from a major lab in the last 24h (allenai/Molmo2-ER), and it is specialized for embodied/spatial reasoning rather than JSON attribute extraction. The community continues to flood image-text-to-text with Qwen3.6 derivatives (heretic/distilled/abliterated) β€” none are signal.


Medium Relevance β€” worth watching

1. allenai/Molmo2-ER

  • Released: 2026-05-04 (1 day ago) β€” 0 dl / 1 like at scrape time
  • Size: 5B (Qwen3-4B-Instruct-2507 backbone + SigLIP2 vision encoder)
  • License: Apache-2.0
  • Link: allenai/Molmo2-ER
  • Why it matters: Same backbone family as our top performer (Qwen3-VL); +17pt over Molmo2-4B on AI2's 13-benchmark embodied reasoning suite (Point-Bench, RefSpatial, BLINK, CV-Bench, ERQA, EmbSpatial, MindCube, SAT, VSI-Bench). Reportedly competitive with Gemini Robot-ER 1.5 Thinking and GPT-5 on grounding tasks.
  • Risk for our use case: Trained for pixel-accurate pointing and spatial reasoning, not structured JSON extraction. Likely lower out-of-the-box on our 9-field schema than qwen3-vl-8b-instruct-base (0.8751). Could help defect localization if we revisit SAM3.1's bbox/mask pipeline (currently 226-tuple training set), but we already have a working pipeline there.
  • Action: Skip benchmark on hard-eval-3500 unless we want a SigLIP2-backbone reference point. If we revisit garment-localization, this is the strongest candidate to bolt onto sellability scoring.

2. atbender/Qwen3.6-VL-REAP-26B-A3B

  • Released: 2026-04-19 (16 days ago) β€” 637 dl / 9 likes
  • Size: 26B-A3B (REAP-pruned from official Qwen/Qwen3.6-35B-A3B, ~26% reduction)
  • License: Apache-2.0
  • Link: atbender/Qwen3.6-VL-REAP-26B-A3B
  • Why it matters: Smaller MoE footprint than the official 35B-A3B; if quality holds, ~3B active params at inference is very attractive for our RTX PRO 6000. Quantized variants (NVFP4, AWQ, GGUF) are already appearing in the community.
  • Risk: Third-party prune, no published eval. Pruning hits language fluency before vision typically β€” would need our own benchmark, not a community LLMArena number.
  • Action: Hold. Re-evaluate if Qwen/Qwen3.6 lands an official VL-tagged release, or if atbender publishes pruning eval data.

Low Relevance β€” note and skip

  • Community Qwen3.6 derivatives (Claude-4.6/4.7-Opus-Distilled, heretic, abliterated, Wasserstein-NLS, etc.) β€” uncensoring/distillation experiments on Qwen3.6-27B and Qwen3.6-35B-A3B. Not relevant to a structured-extraction task.
  • Gemma-4 community fine-tunes (spoomplesmaxx-gemma4-31B, gemma4-bangla-synthdog, RickGemma4b_fr, gemma-4-E2B-it-PARO) β€” domain mismatch (uncensoring, hindi/bangla OCR, persona tuning).
  • LightOnOCR-2 fine-tunes (Phu-Hien/LightOnOCR-2-ft-04, AlgirdasV/LightOnOCR-2-ft-iam) β€” pure OCR pipeline, not multi-attribute classification.
  • Anony100/FashionVLM β€” no model card, no published architecture, no eval. Cannot evaluate.
  • google/gemma-4-*-it-assistant variants (2026-04-23) β€” assistant fine-tunes of the Gemma-4 family covered in earlier reports. No evidence of garment-domain gains.

Notable absences (still no movement)

  • No official Qwen/Qwen3.6-VL release. Qwen3.6-27B and Qwen3.6-35B-A3B (released 2026-04-15 and 2026-04-21, both Apache-2.0, both image-text-to-text-tagged) remain the only Qwen3.6 entry points. Worth re-confirming whether their image branch matches Qwen3-VL quality on a small sample before any hard-eval spend.
  • No Florence-3. Florence-2 still the only enc-dec contender; remains unable to handle our 9-field JSON well (~0.39 weighted overall).
  • No Phi-4-Multimodal update. Last meaningful release was the original.
  • No new MiniCPM-V, InternVL4, or DeepSeek-VL release in the last 14 days.

Recommended actions

  1. No new benchmark spend required from today's scouting.
  2. Watch list: allenai/Molmo2-ER if defect-localization comes back into scope; atbender/Qwen3.6-VL-REAP-26B-A3B pending an official Qwen3.6-VL release or third-party eval data.
  3. Open question for prioritization: should we spend a benchmark slot on Qwen/Qwen3.6-27B (1.4M dl, Apache-2.0, dense, fits in 98GB at BF16) before continuing iteration on qwen3-vl-8b-sft+grpo? Architecture is image-text-to-text-tagged but vision quality vs. Qwen3-VL-8B is unverified for our schema.

Sign up or log in to comment