Daily Model Scout Report β€” 2026-04-23

#14
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-04-23

Scope

Scan of HuggingFace for VLMs created or modified between 2026-04-16 and 2026-04-23, broad across architectures. Current baseline for comparison (weighted_score on our 3,500-sample hard eval):

Model Weighted Score
qwen3-vl-8b-sft+grpo 0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9 0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4 0.8945 (best quantized)
qwen35-2b-base 0.8437 (best Qwen3.5 base)

Candidates

1. Qwen/Qwen3.6-27B β€” Relevance: HIGH

  • Link: https://huggingface.co/Qwen/Qwen3.6-27B
  • Released: 2026-04-16 (new this window β€” sibling of the 35B-A3B flagged last week)
  • Size: 27B dense, Causal Language Model with Vision Encoder
  • Pipeline: image-text-to-text β€” native multimodal (image + video + text)
  • Context: 262K native, extensible to 1M
  • License: Apache 2.0
  • VRAM: ~54 GB BF16, ~27 GB FP8 β€” fits comfortably on RTX PRO 6000 98GB
  • Downloads: 23,964 / month; 592 likes in first week
  • Reported benchmarks: MMMU 82.9, MMMU-Pro 75.8, MathVista mini 87.4, RealWorldQA 84.1, RefCOCO 92.5, CountBench 97.8

Why it may beat our best (0.9131):

  • Strongest reported MMMU of any open VLM this month (82.9) β€” ~6 points above Qwen3-VL-8B-Instruct and above even Gemma 4 31B (MMMU-Pro 76.9).
  • Dense 27B drops cleanly into our Qwen3-VL SFT+GRPO pipeline β€” same processor / chat template family as Qwen3-VL, so our reward engine and eval harness port with near-zero changes.
  • RefCOCO 92.5 and CountBench 97.8 suggest markedly stronger localization and counting, both relevant for closure/sleeve/neckline attributes where our current best tops out below 90.
  • Native function-calling for structured JSON output β€” may close the format gap without relying entirely on SFT.

Action: Benchmark zero-shot on the 3,500 eval set this week. If base β‰₯ 0.85 (above qwen35-2b-base), kick off a full SFT+GRPO run alongside the Qwen3.6-35B-A3B run from last week's scout.


2. fudan-generative-ai/Bard-VL-B4-Mask-8B-Instruct β€” Relevance: MEDIUM

  • Link: https://huggingface.co/fudan-generative-ai/Bard-VL-B4-Mask-8B-Instruct
  • Released: 2026-04-22 (1 day old)
  • Size: 9B (8B-class), BF16
  • Architecture: Novel β€” masked discrete-diffusion VLM, not autoregressive. Uses Progressive Block Merging (PBM), Stage-Wise Distillation (SWD), and Packed Multimodal Attention Mask.
  • License: MIT
  • Reported benchmarks: MMMU 54.6, MMMU-Pro 37.6, MME 2393, RealWorldQA 70.7, MMStar 65.0, AI2D 83.2, ChartQA 84.6

Why it matters:

  • First production-grade diffusion-style VLM we've seen on HF with open weights at 8B scale. Block-parallel decoding (block size 4, 4 denoising steps) could cut inference latency substantially vs. token-by-token autoregressive models.
  • Our 9-field JSON output is fixed-structure β€” diffusion decoding is natively suited to parallel structured generation, potentially eliminating the throughput gap between dense and quantized models.

Why to be cautious:

  • Benchmarks are weak relative to Qwen3-VL-8B (MMMU 54.6 vs. ~70+ for our base). Raw capability likely below our current best even after SFT.
  • Dependency on diffusers==0.36.0 and a custom inference path β€” our vLLM / NVFP4 quantization pipeline will not work out of the box.
  • No prior fashion / garment fine-tunes published; we'd be the first to report.

Action: Low-priority spike (1 day). Run zero-shot on the 3,500 set to confirm base quality. If β‰₯ 0.55, file for a future inference-speed-focused experiment rather than an accuracy run.


3. sabaridsnfuji/Qwen3-VL-4B-Spatial-Analysisv2 β€” Relevance: LOW

Why noted: Same base family as our stack, but task-orthogonal (spatial bounding-box reasoning, not attribute classification). Its training signal is unlikely to transfer to our 9-field schema, and no model card details the training data or eval.

Action: Skip. If we want a Qwen3-VL-4B base anchor, pull the clean Qwen/Qwen3-VL-4B-Instruct instead.


4. bravesoftware/Ocelot-1-VL β€” Relevance: LOW

  • Link: https://huggingface.co/bravesoftware/Ocelot-1-VL
  • Released: 2026-04-22
  • Base: Qwen3-VL-4B-Instruct + LoRA adapter
  • License: Apache 2.0
  • Purpose: Web page summarization for Brave's Leo AI β€” model card explicitly says "NOT designed for general-purpose chat, coding, reasoning, tool use, creative writing, or agentic workflows."

Why noted: Confirms Qwen3-VL-4B is a popular production base β€” interesting as a LoRA-on-Qwen3-VL-4B deployment reference (vLLM --enable-lora with --max-lora-rank 64), but the adapter itself is irrelevant to garment classification.

Action: Skip the weights. Worth noting the Brave vLLM LoRA deployment recipe β€” may be useful if we ever productionize a LoRA-per-retailer strategy rather than merging.


Follow-ups from prior scouts

  • Qwen/Qwen3.6-35B-A3B (flagged HIGH on 2026-04-16): Confirm benchmark status. If not yet run, this is the single highest-priority item β€” Qwen3.6-27B sibling results below will inform whether the MoE variant is worth the full SFT+GRPO budget.
  • google/gemma-4-E4B-it / gemma-4-E2B-it (flagged HIGH on 2026-04-16): Confirm zero-shot numbers. No new Gemma 4 checkpoints this week β€” the family remains open for us to evaluate first against a non-Qwen hard-eval baseline.
  • google/gemma-4-26B-A4B-it / gemma-4-31B-it (flagged MEDIUM on 2026-04-16): Unchanged recommendation β€” fold into the MoE-vs-MoE sweep with Qwen3.6-35B-A3B.

Skipped (surfaced but not relevant)

  • Huihui-Qwen3.6-27B-abliterated, Qwen3.6-27B-heretic, Qwen3.6-Queen-27B, Qwen3.6-27B-Uncensored-HauhauCS-Aggressive β€” community safety-tuning (abliteration / uncensoring) variants of Qwen3.6-27B. Same base weights, no upgrade for garment classification.
  • Qwen3.6-27B-MXFP4, Qwen3.6-27B-W4A16-G128, Qwen3.6-27B-GGUF, Qwen3.6-27B-MLX-{4bit,8bit}, Huihui-Qwen3.6-27B-abliterated-NVFP4 β€” quantizations of Qwen3.6-27B. Evaluate only after the BF16 base has been benchmarked.
  • Holo3-35B-A3B-{JANGTQ2,JANGTQ4,mxfp4}, Qwen3.6-27B-JANG_4M β€” community MoE quantizations; placeholder uploads with no published benchmarks.
  • Marchris/gemma-4-31B-it, ruygar/gemma-4-E{2,4}B-it-BB β€” community re-uploads / forks of Gemma 4, same weights.
  • DeepSeek V4 β€” still unreleased as of 2026-04-23 (Reuters reports launch "in the next few weeks" on Huawei chips). Watch for next week's scout.
  • No new InternVL4, Florence-3, MiniCPM-V5, SmolVLM3, Idefics4, Molmo2, Moondream3, or PaliGemma3 releases detected.
  • No new dedicated garment / fashion / apparel VLM releases this window β€” the Qwen3-VL-fashion-product-images fine-tunes flagged last week remain the only fashion-domain publishing at our size tier.

Recommended Next Steps

  1. Zero-shot Qwen/Qwen3.6-27B on the 3,500 hard eval this week β€” same family as our champion, higher reported vision benchmarks than any open VLM this month, trivial pipeline port.
  2. Confirm status of last week's Qwen3.6-35B-A3B and Gemma 4 benchmarks. The 27B dense β†’ 35B-A3B MoE comparison within Qwen3.6 is the cleanest architectural ablation available and should be run together.
  3. Spike Bard-VL-B4-Mask-8B-Instruct as a 1-day inference-latency experiment only β€” not a SFT candidate unless zero-shot clears 0.55.

Best current benchmark to beat: qwen3-vl-8b-sft+grpo at 0.9131 weighted.

Sign up or log in to comment