You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3-VL-8B-Instruct (Base)

Zero-shot baseline of Qwen3-VL-8B-Instruct for garment classification. This is the base model before any Denali-AI fine-tuning. Ranked #3/21 on the Denali-AI eval_hard_3500 benchmark with 78.1% weighted score (zero-shot).

Model Details

Property Value
Architecture Qwen3-VL
Parameters 8B
Base Model Qwen/Qwen3-VL-8B-Instruct
Training None (zero-shot baseline)
Task Garment Attribute Extraction (9-field JSON)
Output Format Structured JSON

Key Highlights

  • Zero-shot baseline β€” no task-specific fine-tuning applied
  • 100% JSON parse rate β€” model produces valid structured JSON out of the box
  • #3/21 on the Denali-AI garment classification leaderboard
  • Strongest zero-shot base model tested β€” outperforms all fine-tuned 2B models except the SFT+GRPO variant
  • Throughput: 5.5 samples/s

Benchmark Results

Rank #3/21 on eval_hard_3500

Metric Score
Weighted Score 78.1%
SBERT+NLI Combined 75.6%
JSON Parse Rate 100%
Throughput 5.5 samples/s
Inference Time 640s (3500 samples)

Per-Field Scores

Field SBERT NLI Levenshtein Token F1 SBERT+NLI Weight
type 78.9% 67.0% 72.0% 59.6% 69.6% 2.5x
color 80.7% 61.7% 65.9% 41.6% 71.2% 1.0x
pattern 67.6% 67.1% 63.6% 48.0% 59.9% 1.0x
closure 43.2% 34.6% 41.0% 29.1% 35.4% 1.0x
sleeve 77.2% 88.1% 76.6% 77.1% 82.9% 1.0x
neckline 80.8% 75.0% 79.9% 73.3% 73.5% 1.0x
defect 96.1% 96.1% 95.9% 95.5% 96.0% 2.0x
brand 93.4% 93.4% 93.5% 92.6% 93.2% 1.5x
size 98.8% 98.7% 98.7% 98.6% 98.7% 1.5x

Visualizations

Radar Chart Leaderboard Metrics Throughput

Full Leaderboard

Rank Model Weighted SBERT+NLI JSON Parse Throughput Inference
1 qwen3-vl-8b-sft+grpo 80.9% 78.7% 100% 7.5/s 464s
2 qwen3-vl-2b-sft-grpo-v9 79.9% 78.5% 100% 15.9/s 220s
3 qwen3-vl-8b-instruct-base >>> 78.1% 75.6% 100% 5.5/s 640s
4 qwen3-vl-8b-instruct-nvfp4 77.8% 75.0% 100% 8.2/s 424s
5 qwen35-2b-base 76.2% 73.0% 100% 6.6/s 534s
6 qwen3-vl-2b-sft-grpo-v9-nvfp4 74.6% 74.1% 100% 17.2/s 203s
7 qwen3-vl-2b-instruct-base 68.0% 66.7% 100% 15.1/s 231s
8 internvl3-2b-grpo-gtpo-full 67.5% 64.3% 100% 11.8/s 297s
9 internvl3-2b-grpo-gtpo-fp8 67.1% 63.8% 100% 14.3/s 244s
10 internvl3-2b-base 66.8% 63.7% 100% 11.8/s 297s
11 moondream2-base 63.8% 61.8% 100% 1.4/s 2416s
12 qwen35-2b-sft-grpo-gtpo-v8 60.7% 60.1% 100% 11.3/s 309s
13 qwen35-2b-sft-v7 58.6% 58.9% 100% 11.6/s 302s
14 qwen35-35b-a3b-gptq-int4 51.5% 48.7% 14% 1.6/s 2124s
15 qwen35-9b-nvfp4-v10 48.9% 46.0% 8% 1.7/s 2075s
16 qwen35-9b-sft-nvfp4-v11 48.3% 45.5% 8% 1.7/s 2023s
17 qwen35-2b-base-nvfp4-v10 45.9% 42.9% 0% 4.0/s 878s
18 qwen3.5-122b-a10b-nvfp4 45.9% 42.9% 0% 1.2/s 2893s
19 qwen35-2b-sft-nvfp4-v11 45.9% 42.9% 0% 4.0/s 876s
20 qwen35-2b-sft-grpo-gtpo-nvfp4 45.9% 42.9% 0% 3.9/s 907s
21 qwen3-vl-8b-sft-grpo 0.0% 0.0% 100% 0.0/s 462s

Comparative Analysis

  • vs qwen3-vl-2b-sft-grpo-v9: -1.8pp weighted score
    • type: +1.4pp
    • color: -5.8pp
    • pattern: -2.9pp
    • closure: -26.0pp
    • sleeve: -2.7pp
    • neckline: +3.8pp
    • defect: -1.2pp
    • brand: +3.9pp
    • size: +2.9pp
  • vs qwen3-vl-2b-instruct-base: +10.1pp weighted score
    • type: +1.3pp
    • color: +2.6pp
    • pattern: +7.1pp
    • closure: +3.0pp
    • sleeve: +4.8pp
    • neckline: +13.5pp
    • defect: +41.1pp
    • brand: +6.8pp
    • size: -0.2pp
  • vs qwen35-2b-base: +1.9pp weighted score
    • type: +1.3pp
    • color: +3.9pp
    • pattern: -0.0pp
    • closure: +1.4pp
    • sleeve: +2.4pp
    • neckline: +16.1pp
    • defect: -0.0pp
    • brand: -1.0pp
    • size: -0.6pp

Improvement Recommendations

  • Fine-tuning (SFT): The 2B Qwen3-VL variant gained +13.1pp from SFT+GRPO. Applying SFT to this 8B model could push it well above 80%.
  • Closure field: At 35.4% SBERT+NLI, closure is the weakest field β€” targeted data augmentation for closure types (zipper, button, snap, etc.) would help significantly.
  • GRPO/GTPO reinforcement: After SFT, reward-model-based RL (GRPO or GTPO) could further refine per-field accuracy, especially on high-weight fields (type, defect).
  • Quantization: NVFP4 quantization could improve throughput from 5.5 to ~10+ samples/s while maintaining accuracy (unlike Qwen3.5 NVFP4 which degraded badly).

Alternative Models

  • Qwen3-VL-4B-Instruct β€” mid-point between 2B and 8B, worth testing for speed/quality tradeoff
  • InternVL3-8B β€” competing 8B VLM architecture, could serve as a direct comparison
  • Qwen3.5-VL-8B β€” newer architecture revision, may have improved structured output compliance
  • SmolVLM2-8B β€” alternative lightweight VLM worth benchmarking

Evaluation Methodology

Models are evaluated on the eval_hard_3500 benchmark using:

Metric Description
SBERT Cosine Semantic similarity via sentence-transformers (all-MiniLM-L6-v2)
NLI Score Natural language inference entailment scoring
Levenshtein Ratio Fuzzy string matching
Token F1 Token-level precision/recall
Weighted Score Field-weighted aggregate (type=2.5x, defect=2.0x, brand/size=1.5x)

Citation

@misc{denali-ai-qwen3-vl-8b-instruct-base,
  title={Qwen3-VL-8B-Instruct (Base)},
  author={Denali AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Denali-AI/qwen3-vl-8b-instruct-base}
}

License

This model is released under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Denali-AI/qwen3-vl-8b-instruct-base

Finetuned
(219)
this model

Collection including Denali-AI/qwen3-vl-8b-instruct-base