Daily Model Scout Report β 2026-04-23
Daily Model Scout Report β 2026-04-23
Scope
Scan of HuggingFace for VLMs created or modified between 2026-04-16 and 2026-04-23, broad across architectures. Current baseline for comparison (weighted_score on our 3,500-sample hard eval):
| Model | Weighted Score |
|---|---|
| qwen3-vl-8b-sft+grpo | 0.9131 (best overall) |
| qwen3-vl-2b-sft-grpo-v9 | 0.8948 (best small) |
| qwen3-vl-8b-sft-grpo-nvfp4 | 0.8945 (best quantized) |
| qwen35-2b-base | 0.8437 (best Qwen3.5 base) |
Candidates
1. Qwen/Qwen3.6-27B β Relevance: HIGH
- Link: https://huggingface.co/Qwen/Qwen3.6-27B
- Released: 2026-04-16 (new this window β sibling of the 35B-A3B flagged last week)
- Size: 27B dense,
Causal Language Model with Vision Encoder - Pipeline:
image-text-to-textβ native multimodal (image + video + text) - Context: 262K native, extensible to 1M
- License: Apache 2.0
- VRAM: ~54 GB BF16, ~27 GB FP8 β fits comfortably on RTX PRO 6000 98GB
- Downloads: 23,964 / month; 592 likes in first week
- Reported benchmarks: MMMU 82.9, MMMU-Pro 75.8, MathVista mini 87.4, RealWorldQA 84.1, RefCOCO 92.5, CountBench 97.8
Why it may beat our best (0.9131):
- Strongest reported MMMU of any open VLM this month (82.9) β ~6 points above Qwen3-VL-8B-Instruct and above even Gemma 4 31B (MMMU-Pro 76.9).
- Dense 27B drops cleanly into our Qwen3-VL SFT+GRPO pipeline β same processor / chat template family as Qwen3-VL, so our reward engine and eval harness port with near-zero changes.
- RefCOCO 92.5 and CountBench 97.8 suggest markedly stronger localization and counting, both relevant for closure/sleeve/neckline attributes where our current best tops out below 90.
- Native function-calling for structured JSON output β may close the format gap without relying entirely on SFT.
Action: Benchmark zero-shot on the 3,500 eval set this week. If base β₯ 0.85 (above qwen35-2b-base), kick off a full SFT+GRPO run alongside the Qwen3.6-35B-A3B run from last week's scout.
2. fudan-generative-ai/Bard-VL-B4-Mask-8B-Instruct β Relevance: MEDIUM
- Link: https://huggingface.co/fudan-generative-ai/Bard-VL-B4-Mask-8B-Instruct
- Released: 2026-04-22 (1 day old)
- Size: 9B (8B-class), BF16
- Architecture: Novel β masked discrete-diffusion VLM, not autoregressive. Uses Progressive Block Merging (PBM), Stage-Wise Distillation (SWD), and Packed Multimodal Attention Mask.
- License: MIT
- Reported benchmarks: MMMU 54.6, MMMU-Pro 37.6, MME 2393, RealWorldQA 70.7, MMStar 65.0, AI2D 83.2, ChartQA 84.6
Why it matters:
- First production-grade diffusion-style VLM we've seen on HF with open weights at 8B scale. Block-parallel decoding (block size 4, 4 denoising steps) could cut inference latency substantially vs. token-by-token autoregressive models.
- Our 9-field JSON output is fixed-structure β diffusion decoding is natively suited to parallel structured generation, potentially eliminating the throughput gap between dense and quantized models.
Why to be cautious:
- Benchmarks are weak relative to Qwen3-VL-8B (MMMU 54.6 vs. ~70+ for our base). Raw capability likely below our current best even after SFT.
- Dependency on
diffusers==0.36.0and a custom inference path β our vLLM / NVFP4 quantization pipeline will not work out of the box. - No prior fashion / garment fine-tunes published; we'd be the first to report.
Action: Low-priority spike (1 day). Run zero-shot on the 3,500 set to confirm base quality. If β₯ 0.55, file for a future inference-speed-focused experiment rather than an accuracy run.
3. sabaridsnfuji/Qwen3-VL-4B-Spatial-Analysisv2 β Relevance: LOW
- Link: https://huggingface.co/sabaridsnfuji/Qwen3-VL-4B-Spatial-Analysisv2
- Released: 2026-04-23 (hours old)
- Base: Qwen3-VL-4B
- Purpose: Spatial reasoning / localization fine-tune (community, single-author)
Why noted: Same base family as our stack, but task-orthogonal (spatial bounding-box reasoning, not attribute classification). Its training signal is unlikely to transfer to our 9-field schema, and no model card details the training data or eval.
Action: Skip. If we want a Qwen3-VL-4B base anchor, pull the clean Qwen/Qwen3-VL-4B-Instruct instead.
4. bravesoftware/Ocelot-1-VL β Relevance: LOW
- Link: https://huggingface.co/bravesoftware/Ocelot-1-VL
- Released: 2026-04-22
- Base: Qwen3-VL-4B-Instruct + LoRA adapter
- License: Apache 2.0
- Purpose: Web page summarization for Brave's Leo AI β model card explicitly says "NOT designed for general-purpose chat, coding, reasoning, tool use, creative writing, or agentic workflows."
Why noted: Confirms Qwen3-VL-4B is a popular production base β interesting as a LoRA-on-Qwen3-VL-4B deployment reference (vLLM --enable-lora with --max-lora-rank 64), but the adapter itself is irrelevant to garment classification.
Action: Skip the weights. Worth noting the Brave vLLM LoRA deployment recipe β may be useful if we ever productionize a LoRA-per-retailer strategy rather than merging.
Follow-ups from prior scouts
Qwen/Qwen3.6-35B-A3B(flagged HIGH on 2026-04-16): Confirm benchmark status. If not yet run, this is the single highest-priority item β Qwen3.6-27B sibling results below will inform whether the MoE variant is worth the full SFT+GRPO budget.google/gemma-4-E4B-it/gemma-4-E2B-it(flagged HIGH on 2026-04-16): Confirm zero-shot numbers. No new Gemma 4 checkpoints this week β the family remains open for us to evaluate first against a non-Qwen hard-eval baseline.google/gemma-4-26B-A4B-it/gemma-4-31B-it(flagged MEDIUM on 2026-04-16): Unchanged recommendation β fold into the MoE-vs-MoE sweep with Qwen3.6-35B-A3B.
Skipped (surfaced but not relevant)
- Huihui-Qwen3.6-27B-abliterated, Qwen3.6-27B-heretic, Qwen3.6-Queen-27B, Qwen3.6-27B-Uncensored-HauhauCS-Aggressive β community safety-tuning (abliteration / uncensoring) variants of Qwen3.6-27B. Same base weights, no upgrade for garment classification.
- Qwen3.6-27B-MXFP4, Qwen3.6-27B-W4A16-G128, Qwen3.6-27B-GGUF, Qwen3.6-27B-MLX-{4bit,8bit}, Huihui-Qwen3.6-27B-abliterated-NVFP4 β quantizations of Qwen3.6-27B. Evaluate only after the BF16 base has been benchmarked.
- Holo3-35B-A3B-{JANGTQ2,JANGTQ4,mxfp4}, Qwen3.6-27B-JANG_4M β community MoE quantizations; placeholder uploads with no published benchmarks.
- Marchris/gemma-4-31B-it, ruygar/gemma-4-E{2,4}B-it-BB β community re-uploads / forks of Gemma 4, same weights.
- DeepSeek V4 β still unreleased as of 2026-04-23 (Reuters reports launch "in the next few weeks" on Huawei chips). Watch for next week's scout.
- No new InternVL4, Florence-3, MiniCPM-V5, SmolVLM3, Idefics4, Molmo2, Moondream3, or PaliGemma3 releases detected.
- No new dedicated garment / fashion / apparel VLM releases this window β the Qwen3-VL-fashion-product-images fine-tunes flagged last week remain the only fashion-domain publishing at our size tier.
Recommended Next Steps
- Zero-shot
Qwen/Qwen3.6-27Bon the 3,500 hard eval this week β same family as our champion, higher reported vision benchmarks than any open VLM this month, trivial pipeline port. - Confirm status of last week's Qwen3.6-35B-A3B and Gemma 4 benchmarks. The 27B dense β 35B-A3B MoE comparison within Qwen3.6 is the cleanest architectural ablation available and should be run together.
- Spike Bard-VL-B4-Mask-8B-Instruct as a 1-day inference-latency experiment only β not a SFT candidate unless zero-shot clears 0.55.
Best current benchmark to beat: qwen3-vl-8b-sft+grpo at 0.9131 weighted.