Daily Model Scout Report — 2026-04-23

#14

by msudharsanan - opened 21 days ago

Denali Advanced Integration org 21 days ago

Daily Model Scout Report — 2026-04-23

Scope

Scan of HuggingFace for VLMs created or modified between 2026-04-16 and 2026-04-23, broad across architectures. Current baseline for comparison (weighted_score on our 3,500-sample hard eval):

Model	Weighted Score
qwen3-vl-8b-sft+grpo	0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9	0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4	0.8945 (best quantized)
qwen35-2b-base	0.8437 (best Qwen3.5 base)

Candidates

1. `Qwen/Qwen3.6-27B` — Relevance: HIGH

Link: https://huggingface.co/Qwen/Qwen3.6-27B
Released: 2026-04-16 (new this window — sibling of the 35B-A3B flagged last week)
Size: 27B dense, Causal Language Model with Vision Encoder
Pipeline: image-text-to-text — native multimodal (image + video + text)
Context: 262K native, extensible to 1M
License: Apache 2.0
VRAM: ~54 GB BF16, ~27 GB FP8 — fits comfortably on RTX PRO 6000 98GB
Downloads: 23,964 / month; 592 likes in first week
Reported benchmarks: MMMU 82.9, MMMU-Pro 75.8, MathVista mini 87.4, RealWorldQA 84.1, RefCOCO 92.5, CountBench 97.8

Why it may beat our best (0.9131):

Strongest reported MMMU of any open VLM this month (82.9) — ~6 points above Qwen3-VL-8B-Instruct and above even Gemma 4 31B (MMMU-Pro 76.9).
Dense 27B drops cleanly into our Qwen3-VL SFT+GRPO pipeline — same processor / chat template family as Qwen3-VL, so our reward engine and eval harness port with near-zero changes.
RefCOCO 92.5 and CountBench 97.8 suggest markedly stronger localization and counting, both relevant for closure/sleeve/neckline attributes where our current best tops out below 90.
Native function-calling for structured JSON output — may close the format gap without relying entirely on SFT.

Action: Benchmark zero-shot on the 3,500 eval set this week. If base ≥ 0.85 (above qwen35-2b-base), kick off a full SFT+GRPO run alongside the Qwen3.6-35B-A3B run from last week's scout.

2. `fudan-generative-ai/Bard-VL-B4-Mask-8B-Instruct` — Relevance: MEDIUM

Link: https://huggingface.co/fudan-generative-ai/Bard-VL-B4-Mask-8B-Instruct
Released: 2026-04-22 (1 day old)
Size: 9B (8B-class), BF16
Architecture: Novel — masked discrete-diffusion VLM, not autoregressive. Uses Progressive Block Merging (PBM), Stage-Wise Distillation (SWD), and Packed Multimodal Attention Mask.
License: MIT
Reported benchmarks: MMMU 54.6, MMMU-Pro 37.6, MME 2393, RealWorldQA 70.7, MMStar 65.0, AI2D 83.2, ChartQA 84.6

Why it matters:

First production-grade diffusion-style VLM we've seen on HF with open weights at 8B scale. Block-parallel decoding (block size 4, 4 denoising steps) could cut inference latency substantially vs. token-by-token autoregressive models.
Our 9-field JSON output is fixed-structure — diffusion decoding is natively suited to parallel structured generation, potentially eliminating the throughput gap between dense and quantized models.

Why to be cautious:

Benchmarks are weak relative to Qwen3-VL-8B (MMMU 54.6 vs. ~70+ for our base). Raw capability likely below our current best even after SFT.
Dependency on diffusers==0.36.0 and a custom inference path — our vLLM / NVFP4 quantization pipeline will not work out of the box.
No prior fashion / garment fine-tunes published; we'd be the first to report.

Action: Low-priority spike (1 day). Run zero-shot on the 3,500 set to confirm base quality. If ≥ 0.55, file for a future inference-speed-focused experiment rather than an accuracy run.

3. `sabaridsnfuji/Qwen3-VL-4B-Spatial-Analysisv2` — Relevance: LOW

Link: https://huggingface.co/sabaridsnfuji/Qwen3-VL-4B-Spatial-Analysisv2
Released: 2026-04-23 (hours old)
Base: Qwen3-VL-4B
Purpose: Spatial reasoning / localization fine-tune (community, single-author)

Why noted: Same base family as our stack, but task-orthogonal (spatial bounding-box reasoning, not attribute classification). Its training signal is unlikely to transfer to our 9-field schema, and no model card details the training data or eval.

Action: Skip. If we want a Qwen3-VL-4B base anchor, pull the clean Qwen/Qwen3-VL-4B-Instruct instead.

4. `bravesoftware/Ocelot-1-VL` — Relevance: LOW

Link: https://huggingface.co/bravesoftware/Ocelot-1-VL
Released: 2026-04-22
Base: Qwen3-VL-4B-Instruct + LoRA adapter
License: Apache 2.0
Purpose: Web page summarization for Brave's Leo AI — model card explicitly says "NOT designed for general-purpose chat, coding, reasoning, tool use, creative writing, or agentic workflows."

Why noted: Confirms Qwen3-VL-4B is a popular production base — interesting as a LoRA-on-Qwen3-VL-4B deployment reference (vLLM --enable-lora with --max-lora-rank 64), but the adapter itself is irrelevant to garment classification.

Action: Skip the weights. Worth noting the Brave vLLM LoRA deployment recipe — may be useful if we ever productionize a LoRA-per-retailer strategy rather than merging.

Follow-ups from prior scouts

Qwen/Qwen3.6-35B-A3B (flagged HIGH on 2026-04-16): Confirm benchmark status. If not yet run, this is the single highest-priority item — Qwen3.6-27B sibling results below will inform whether the MoE variant is worth the full SFT+GRPO budget.
google/gemma-4-E4B-it / gemma-4-E2B-it (flagged HIGH on 2026-04-16): Confirm zero-shot numbers. No new Gemma 4 checkpoints this week — the family remains open for us to evaluate first against a non-Qwen hard-eval baseline.
google/gemma-4-26B-A4B-it / gemma-4-31B-it (flagged MEDIUM on 2026-04-16): Unchanged recommendation — fold into the MoE-vs-MoE sweep with Qwen3.6-35B-A3B.

Skipped (surfaced but not relevant)

Huihui-Qwen3.6-27B-abliterated, Qwen3.6-27B-heretic, Qwen3.6-Queen-27B, Qwen3.6-27B-Uncensored-HauhauCS-Aggressive — community safety-tuning (abliteration / uncensoring) variants of Qwen3.6-27B. Same base weights, no upgrade for garment classification.
Qwen3.6-27B-MXFP4, Qwen3.6-27B-W4A16-G128, Qwen3.6-27B-GGUF, Qwen3.6-27B-MLX-{4bit,8bit}, Huihui-Qwen3.6-27B-abliterated-NVFP4 — quantizations of Qwen3.6-27B. Evaluate only after the BF16 base has been benchmarked.
Holo3-35B-A3B-{JANGTQ2,JANGTQ4,mxfp4}, Qwen3.6-27B-JANG_4M — community MoE quantizations; placeholder uploads with no published benchmarks.
Marchris/gemma-4-31B-it, ruygar/gemma-4-E{2,4}B-it-BB — community re-uploads / forks of Gemma 4, same weights.
DeepSeek V4 — still unreleased as of 2026-04-23 (Reuters reports launch "in the next few weeks" on Huawei chips). Watch for next week's scout.
No new InternVL4, Florence-3, MiniCPM-V5, SmolVLM3, Idefics4, Molmo2, Moondream3, or PaliGemma3 releases detected.
No new dedicated garment / fashion / apparel VLM releases this window — the Qwen3-VL-fashion-product-images fine-tunes flagged last week remain the only fashion-domain publishing at our size tier.

Recommended Next Steps

Zero-shot Qwen/Qwen3.6-27B on the 3,500 hard eval this week — same family as our champion, higher reported vision benchmarks than any open VLM this month, trivial pipeline port.
Confirm status of last week's Qwen3.6-35B-A3B and Gemma 4 benchmarks. The 27B dense → 35B-A3B MoE comparison within Qwen3.6 is the cleanest architectural ablation available and should be run together.
Spike Bard-VL-B4-Mask-8B-Instruct as a 1-day inference-latency experiment only — not a SFT candidate unless zero-shot clears 0.55.

Best current benchmark to beat: qwen3-vl-8b-sft+grpo at 0.9131 weighted.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment