Daily Model Scout Report — 2026-05-04

#21

by msudharsanan - opened 11 days ago

Denali Advanced Integration org 11 days ago

Daily Model Scout Report — 2026-05-04

Window: 2026-04-27 → 2026-05-04 (last 7 days). Filtered for new VLM base/instruct releases (excluding GGUF quants, abliterated derivatives, reranker/embedding heads, and unrelated text-only LLMs).

Current Denali-AI baseline (3,500-sample hard eval, `_overall.weighted_score`)

Model	Weighted score
qwen3-vl-8b-sft+grpo	0.9131 (best overall)
qwen3-vl-2b-sft-grpo-v9	0.8948 (best small)
qwen3-vl-8b-sft-grpo-nvfp4	0.8945 (best quantized)
qwen3-vl-8b-instruct-base	0.8751
qwen35-2b-base	0.8437

Note: granite4-vision-sft shows weighted_score=1.0144 in eval_all_results.json — almost certainly an artifact (>1.0 cap) and should be re-verified before use as a comparison anchor.

High Relevance — benchmark immediately

1. ibm-granite/granite-vision-4.1-4b

Released: 2026-04-29 — 7,690 dl / 54 likes
Size: ~4B params (2-shard safetensors), granite4_vision arch with custom_code
Link: https://huggingface.co/ibm-granite/granite-vision-4.1-4b
Why it matters: Direct successor to whatever Granite-4 base our Granite4-Vision-SFT was fine-tuned from. Granite-4.1 is the same week as the new Granite-4.1-3b/8b/30b language line — sharing the new tokenizer + improved vision tower. Worth re-running our SFT recipe on this base.
Risk: custom_code path; vLLM compat already an issue for our existing granite4 SFT artifacts (granite4-vision-sft-vllm and -deepstack collapse to ~46% baseline in the 100-sample eval — the lift only appears in the HF-transformers path). Confirm vLLM/PeakBench serving works before training.
Action: Register in PeakBench, run base eval on 3.5k-hard via peakbench_start_benchmark. If lift is meaningful, queue SFT.

2. nvidia/Cosmos-Reason2-8B

Released: 2026-04-30 — 221,405 dl / 175 likes
Size: 8B (4-shard safetensors), qwen3_vl arch — fine-tune of Qwen/Qwen3-VL-8B-Instruct
Link: https://huggingface.co/nvidia/Cosmos-Reason2-8B
Why it matters: Same exact architecture as our champion qwen3-vl-8b-sft+grpo (so PeakBench/vLLM path is already proven), but with NVIDIA's reasoning post-training. Could give us a stronger starting point for hard-sample garments where chain-of-thought helps disambiguate (closure type, fine pattern). Drop-in replacement candidate for the 8B base.
Action: Register, run 3.5k-hard base eval. If _overall.weighted_score ≥ 0.88 zero-shot (vs 0.8751 for plain Qwen3-VL-8B-Instruct), it's our new SFT base.

3. ibm-granite/granite-4.0-3b-vision

Released: 2026-04-30 — 162,908 dl / 109 likes
Size: ~3B (2-shard safetensors + adapter shard), same granite4_vision arch
Link: https://huggingface.co/ibm-granite/granite-4.0-3b-vision
Why it matters: Smaller granite variant (3B vs 4B). If 4.1-4b isn't enough of a lift, the 3B would be the small-model contender against qwen3-vl-2b-sft-grpo-v9 (0.8948). Same vLLM caveat as #1.
Action: Bench in same pass as #1 — both share the load path.

Medium Relevance — worth watching

4. nvidia/Cosmos-Reason2-32B

Released: 2026-04-30 — 788 dl / 7 likes
Size: 32B (13 shards), Qwen3-VL-32B-Instruct fine-tune
Link: https://huggingface.co/nvidia/Cosmos-Reason2-32B
Why: Inference-only on RTX PRO 6000 98GB (BF16 won't fit, FP8/NVFP4 will). Useful as a quality ceiling reference, not a fine-tune target. Skip unless we need a teacher for distillation.

5. nvidia/Cosmos-Reason2-2B

Released: 2026-04-30 — 144,674 dl / 70 likes
Link: https://huggingface.co/nvidia/Cosmos-Reason2-2B
Why: Already trained internally (job #740, sellability run). Pure-garment eval on this base hasn't been recorded in eval_all_results.json yet — worth a one-off PeakBench run for completeness.

6. lightonai/LightOnOCR-2-1B

Released: 2026-05-04 — 784,707 dl / 677 likes (highest-traction VLM of the week)
Size: 1B, single safetensors, mistral3 arch
Link: https://huggingface.co/lightonai/LightOnOCR-2-1B
Why: Tagged ocr / document-understanding / pdf / tables / forms — primarily a document-OCR model. Garment attribute classification ≠ OCR, so direct fit is weak. However, the brand field (currently 70% on our champion) could benefit from explicit OCR-tuned features, and at 1B it's our smallest viable fine-tune candidate. Lower priority unless we want to tackle brand-recognition specifically.

7. sunjuice/Molmo2-8B

Released: 2026-05-04 — 61 dl / 0 likes
Size: 8B (8 shards), molmo2 arch (OLMo backbone), uses official allenai/Molmo2-* datasets
Link: https://huggingface.co/sunjuice/Molmo2-8B
Why: Community port; allenai itself only released Molmo2-O-7B and Molmo2-4B so far (Jan 2026). Architecture is novel for us — Molmo's pointing/grounding pretraining could help defect localization. Wait for an official 8B from allenai before investing.

8. hybridfree/HY-Embodied-0.5

Released: 2026-05-04 — 13 dl / 0 likes
Size: 2B, hunyuan_vl_mot arch (Mixture-of-Transformers)
Link: https://huggingface.co/hybridfree/HY-Embodied-0.5
Why: Brand-new architecture family from Tencent's Hunyuan VL line. Embodied/robotics framing, not classification. Low priority for garments but worth tracking the architecture.

Low Relevance — note and skip

nvidia/nemotron-ocr-v2 (2026-04-28, 2,547 dl / 172 likes) — pure OCR pipeline, no classification head. https://huggingface.co/nvidia/nemotron-ocr-v2
TP12123/Qwen3-VL-4B-Instruct — appears to be a re-upload of the existing Qwen3-VL-4B; 0 dl / 0 likes / no signal.
llmvision/glimpse-v1 — Gemma-3-4B fine-tune for home security; wrong domain.
FoolDev/janus-27b / FoolDev/janus — GGUF-only community uploads of qwen3_6 arch; not a base for SFT.
ADSKAILab/Zero-To-CAD-Qwen3-VL-2B — image-to-CAD task, irrelevant.

Notable absences

No new Qwen3.5-VL or Qwen3.6-VL official base/instruct release. Qwen org released Qwen3.6-27B and Qwen3.6-35B-A3B (text) on 2026-04-24 — outside the 7-day window and text-only. Community uploads tagged "Qwen3.5-VL" / "Qwen3.6-VL" are MLX/AWQ quants of unreleased weights ("CRACK" suffixes), not legitimate first-party releases. Continue to monitor.
No new InternVL, PaliGemma, Phi-Vision, or SmolVLM official releases this week.
No fashion-/garment-/apparel-specific VLM hits in any search.

Recommended actions

Register ibm-granite/granite-vision-4.1-4b, ibm-granite/granite-4.0-3b-vision, and nvidia/Cosmos-Reason2-8B in PeakBench; queue zero-shot 3.5k-hard benchmarks against our existing prompt set.
If Cosmos-Reason2-8B beats qwen3-vl-8b-instruct-base (0.8751) zero-shot, it becomes the SFT base for the next 8B run — same recipe, single line change in the train config.
Re-verify the granite4-vision-sft 1.0144 weighted score in eval_all_results.json — that value is impossible under the documented scoring scheme and may be polluting our index ranking.

— Auto-generated scout (Claude Code, /hf-model-scout)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Daily Model Scout Report — 2026-05-04

Daily Model Scout Report — 2026-05-04

Current Denali-AI baseline (3,500-sample hard eval, _overall.weighted_score)

High Relevance — benchmark immediately

1. ibm-granite/granite-vision-4.1-4b

2. nvidia/Cosmos-Reason2-8B

3. ibm-granite/granite-4.0-3b-vision

Medium Relevance — worth watching

4. nvidia/Cosmos-Reason2-32B

5. nvidia/Cosmos-Reason2-2B

6. lightonai/LightOnOCR-2-1B

7. sunjuice/Molmo2-8B

8. hybridfree/HY-Embodied-0.5

Low Relevance — note and skip

Notable absences

Recommended actions

Current Denali-AI baseline (3,500-sample hard eval, `_overall.weighted_score`)