Daily Model Scout Report — 2026-04-11
Daily Model Scout Report — 2026-04-11
Scan window: 2026-04-04 → 2026-04-11. Sources: HF image-text-to-text and vision-language listings sorted by creation, plus targeted searches across Qwen3-VL, InternVL, GLM-4.5V, Gemma-4, Phi-4, Granite, Penguin-VL families.
Current internal baselines (3,500-sample hard eval, weighted_score):
qwen3-vl-8b-sft+grpo— 0.9131 (best overall)qwen3-vl-2b-sft-grpo-v9— 0.8948 (best small)qwen3-vl-8b-sft-grpo-nvfp4— 0.8945 (best quantized)qwen35-2b-base— 0.8437 (best Qwen3.5 base)
High relevance — benchmark immediately
1. Tencent / Penguin-VL-8B — 9B, Apache-2.0tencent/Penguin-VL-8B
- New architecture: vision encoder initialized from Qwen3-0.6B (LLM-based), bidirectional attention, 2D RoPE, lightweight MLP projector, Qwen3-8B backbone. Avoids the contrastive-objective mismatch of CLIP/SigLIP encoders.
- Beats Qwen3-VL on multiple text-in-image benchmarks: DocVQA 96.2, InfoVQA 86.8, MathVista 77.4, ChartQA strong. The OCR/structured-text gains are exactly the regime where our 9-field JSON extraction lives.
- Drop-in 9B fits the RTX PRO 6000 budget. Apache-2.0 license is clean for production.
- Action: SFT + GRPO on our garment dataset using existing Qwen3-VL-8B recipe — projector/tokenizer differ but pipeline should port with minor changes.
2. Tencent / Penguin-VL-2B — 2B, Apache-2.0tencent/Penguin-VL-2B
- Same architecture family as the 8B. Benchmark deltas vs Qwen3-VL-2B (the base behind our
qwen3-vl-2b-sft-grpo-v9at 0.8948):- InfoVQA 77.8 vs 72.4 · ChartQA 86.6 vs 76.9 · OCRBench 858 vs 810 · CharXiv DQ 66.4 vs 62.3 · AI2D 80.7 vs 76.9 · MathVista 67.3 vs 61.3.
- Across-the-board base gains in the 2B class. Strong candidate to displace the small-model baseline.
- Action: prioritize this for the next 2B SFT+GRPO run; if it ports cleanly, this is the most direct path to beating 0.8948.
3. IBM / Granite-4.0-3B-Vision — ~4B (3.5B base + 0.5B LoRA), Apache-2.0, released 2026-03-27ibm-granite/granite-4.0-3b-vision
- Purpose-built for structured data extraction (chart→CSV, tables→JSON/HTML/OTSL, semantic key-value pair extraction). Uses task tags like
<tables_json>. - VAREX zero-shot KVP exact-match 85.5%, ranked 3rd in 2–4B class as of March 2026.
- Architecture: SigLIP2 vision encoder + Window Q-Former projectors + 8 vision-to-LLM injection points (Deepstack variant). The Deepstack injection scheme is specifically designed for fine-grained spatial reasoning, which matches our brand/closure/neckline failure modes.
- We already have
granite4-vision-sftin eval (the JSON shows a 1.0144 entry that needs investigation — looks like a scoring bug or different eval slice). This is a newer base than what we benchmarked; worth re-running the full pipeline against the new release. - Action: SFT this on the garment dataset using its native task-tag format (
<json>style), since our task is structurally identical to KVP extraction.
Medium relevance — worth watching
4. pingmong / Qwen3-VL-{8B,2B}-Instruct-fashion-product-images-smallpingmong/Qwen3-VL-8B-Instruct-fashion-product-images-small · pingmong/Qwen3-VL-2B-Instruct-fashion-product-images-small
- Direct fashion-domain Qwen3-VL fine-tunes uploaded ~2 days ago. Same Qwen3-VL base we already use, so no architecture risk.
- No model card, no training data details, no license, ~12 downloads. Cannot trust without provenance.
- Action: download, run on the 3,500 hard eval as a no-finetune baseline only. Do not promote to a starting checkpoint without verifying the data isn't contaminated against our eval set.
5. Phi-4-reasoning-vision-15B (community GGUF)gaoqianshen/Phi-4-reasoning-vision-15B-Q8_0-GGUF
- Quantized Phi-4 reasoning + vision variant, 15B. Microsoft has not posted an official
microsoft/Phi-4-reasoning-visionrepo (404 on the canonical path), so this is community-derived and provenance is unclear. - Phi-4-Multimodal-SFT is currently weak in our eval (0.6513 weighted) — a reasoning variant might help on the harder fields (color/pattern/closure) but the base architecture has not historically been competitive on our task.
- Action: monitor for an official Microsoft release before investing benchmarking time.
6. xiao45791 / Qwen3-VL-8B-Instruct-SFT-Gemini-Distill-100k — 9Bxiao45791/Qwen3-VL-8B-Instruct-SFT-Gemini-Distill-100k
- Qwen3-VL-8B SFT'd on 100k Gemini-distilled samples. Could carry useful general-purpose visual reasoning improvements.
- Action: low-priority eval; not garment-specific.
Low relevance — noted for completeness
- Gemma-4 multimodal community variants (gemma-4-31B-it, E2B variants, MLX/GGUF derivatives): no official Google Gemma-4 Vision release detected this week. All listings are community fine-tunes/quantizations. Wait for an official release.
- GLM-4.5V quantizations (mradermacher GGUFs, cyankiwi AWQ): only repackaging of the existing 107B base — too large for our 98GB budget at acceptable batch sizes, and no new architecture.
- InternVL activity: no InternVL3.5 / InternVL4 this week. Only community fine-tunes of InternVL2.5 and one InternVL3-14B reward-model variant. Skip.
- nanoVLM-222M: 0.2B is below our useful size range for hard-eval accuracy.
- wave-ui-3b, vittle-7b-{L,F}, Qianfan-OCR: domain-specific (UI navigation, OCR) — not garment-relevant.
Recommended actions this week
- Top priority — Penguin-VL-2B SFT+GRPO on the garment dataset. Highest expected upside on the small-model track; architecture is close enough to Qwen3-VL that the existing pipeline should port.
- Penguin-VL-8B base eval on the 3,500 hard set before committing to a full SFT+GRPO run — confirm the OCR/chart gains transfer to our domain.
- Granite-4.0-3B-Vision SFT using its native structured-extraction task tags. Investigate the existing 1.0144 anomaly in
eval_all_results.jsonforgranite4-vision-sftwhile we're in that codepath. - Eval-only sweep of
pingmongfashion fine-tunes to rule them out (or surface them) without committing training compute.
Generated by /hf-model-scout · 2026-04-11