Daily Model Scout Report — 2026-04-11

by msudharsanan - opened Apr 11

Denali Advanced Integration org Apr 11

Daily Model Scout Report — 2026-04-11

Scan window: 2026-04-04 → 2026-04-11. Sources: HF image-text-to-text and vision-language listings sorted by creation, plus targeted searches across Qwen3-VL, InternVL, GLM-4.5V, Gemma-4, Phi-4, Granite, Penguin-VL families.

Current internal baselines (3,500-sample hard eval, weighted_score):

qwen3-vl-8b-sft+grpo — 0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9 — 0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4 — 0.8945 (best quantized)
qwen35-2b-base — 0.8437 (best Qwen3.5 base)

High relevance — benchmark immediately

1. Tencent / Penguin-VL-8B — 9B, Apache-2.0
tencent/Penguin-VL-8B

New architecture: vision encoder initialized from Qwen3-0.6B (LLM-based), bidirectional attention, 2D RoPE, lightweight MLP projector, Qwen3-8B backbone. Avoids the contrastive-objective mismatch of CLIP/SigLIP encoders.
Beats Qwen3-VL on multiple text-in-image benchmarks: DocVQA 96.2, InfoVQA 86.8, MathVista 77.4, ChartQA strong. The OCR/structured-text gains are exactly the regime where our 9-field JSON extraction lives.
Drop-in 9B fits the RTX PRO 6000 budget. Apache-2.0 license is clean for production.
Action: SFT + GRPO on our garment dataset using existing Qwen3-VL-8B recipe — projector/tokenizer differ but pipeline should port with minor changes.

2. Tencent / Penguin-VL-2B — 2B, Apache-2.0
tencent/Penguin-VL-2B

Same architecture family as the 8B. Benchmark deltas vs Qwen3-VL-2B (the base behind our qwen3-vl-2b-sft-grpo-v9 at 0.8948):
- InfoVQA 77.8 vs 72.4 · ChartQA 86.6 vs 76.9 · OCRBench 858 vs 810 · CharXiv DQ 66.4 vs 62.3 · AI2D 80.7 vs 76.9 · MathVista 67.3 vs 61.3.
Across-the-board base gains in the 2B class. Strong candidate to displace the small-model baseline.
Action: prioritize this for the next 2B SFT+GRPO run; if it ports cleanly, this is the most direct path to beating 0.8948.

3. IBM / Granite-4.0-3B-Vision — ~4B (3.5B base + 0.5B LoRA), Apache-2.0, released 2026-03-27
ibm-granite/granite-4.0-3b-vision

Purpose-built for structured data extraction (chart→CSV, tables→JSON/HTML/OTSL, semantic key-value pair extraction). Uses task tags like <tables_json>.
VAREX zero-shot KVP exact-match 85.5%, ranked 3rd in 2–4B class as of March 2026.
Architecture: SigLIP2 vision encoder + Window Q-Former projectors + 8 vision-to-LLM injection points (Deepstack variant). The Deepstack injection scheme is specifically designed for fine-grained spatial reasoning, which matches our brand/closure/neckline failure modes.
We already have granite4-vision-sft in eval (the JSON shows a 1.0144 entry that needs investigation — looks like a scoring bug or different eval slice). This is a newer base than what we benchmarked; worth re-running the full pipeline against the new release.
Action: SFT this on the garment dataset using its native task-tag format (<json> style), since our task is structurally identical to KVP extraction.

Medium relevance — worth watching

4. pingmong / Qwen3-VL-{8B,2B}-Instruct-fashion-product-images-small
pingmong/Qwen3-VL-8B-Instruct-fashion-product-images-small · pingmong/Qwen3-VL-2B-Instruct-fashion-product-images-small

Direct fashion-domain Qwen3-VL fine-tunes uploaded ~2 days ago. Same Qwen3-VL base we already use, so no architecture risk.
No model card, no training data details, no license, ~12 downloads. Cannot trust without provenance.
Action: download, run on the 3,500 hard eval as a no-finetune baseline only. Do not promote to a starting checkpoint without verifying the data isn't contaminated against our eval set.

5. Phi-4-reasoning-vision-15B (community GGUF)
gaoqianshen/Phi-4-reasoning-vision-15B-Q8_0-GGUF

Quantized Phi-4 reasoning + vision variant, 15B. Microsoft has not posted an official microsoft/Phi-4-reasoning-vision repo (404 on the canonical path), so this is community-derived and provenance is unclear.
Phi-4-Multimodal-SFT is currently weak in our eval (0.6513 weighted) — a reasoning variant might help on the harder fields (color/pattern/closure) but the base architecture has not historically been competitive on our task.
Action: monitor for an official Microsoft release before investing benchmarking time.

6. xiao45791 / Qwen3-VL-8B-Instruct-SFT-Gemini-Distill-100k — 9B
xiao45791/Qwen3-VL-8B-Instruct-SFT-Gemini-Distill-100k

Qwen3-VL-8B SFT'd on 100k Gemini-distilled samples. Could carry useful general-purpose visual reasoning improvements.
Action: low-priority eval; not garment-specific.

Low relevance — noted for completeness

Gemma-4 multimodal community variants (gemma-4-31B-it, E2B variants, MLX/GGUF derivatives): no official Google Gemma-4 Vision release detected this week. All listings are community fine-tunes/quantizations. Wait for an official release.
GLM-4.5V quantizations (mradermacher GGUFs, cyankiwi AWQ): only repackaging of the existing 107B base — too large for our 98GB budget at acceptable batch sizes, and no new architecture.
InternVL activity: no InternVL3.5 / InternVL4 this week. Only community fine-tunes of InternVL2.5 and one InternVL3-14B reward-model variant. Skip.
nanoVLM-222M: 0.2B is below our useful size range for hard-eval accuracy.
wave-ui-3b, vittle-7b-{L,F}, Qianfan-OCR: domain-specific (UI navigation, OCR) — not garment-relevant.

Recommended actions this week

Top priority — Penguin-VL-2B SFT+GRPO on the garment dataset. Highest expected upside on the small-model track; architecture is close enough to Qwen3-VL that the existing pipeline should port.
Penguin-VL-8B base eval on the 3,500 hard set before committing to a full SFT+GRPO run — confirm the OCR/chart gains transfer to our domain.
Granite-4.0-3B-Vision SFT using its native structured-extraction task tags. Investigate the existing 1.0144 anomaly in eval_all_results.json for granite4-vision-sft while we're in that codepath.
Eval-only sweep of pingmong fashion fine-tunes to rule them out (or surface them) without committing training compute.

Generated by /hf-model-scout · 2026-04-11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment