Spaces:

Denali-AI
/

README

Configuration error

File size: 22,529 Bytes

---
title: Denali AI
short_description: VLMs for Garment Attribute Extraction
---

# Denali AI — Vision-Language Models for Garment Classification

<div align="center">

**Advancing structured attribute extraction from garment images through multi-stage reinforcement learning**

[![Models](https://img.shields.io/badge/Models-28-blue)](https://huggingface.co/Denali-AI)
[![Benchmark](https://img.shields.io/badge/Benchmark-3%2C500_samples-green)](https://huggingface.co/datasets/Denali-AI/eval-hard-3500)
[![License](https://img.shields.io/badge/License-Apache_2.0-orange)](https://www.apache.org/licenses/LICENSE-2.0)
[![Best Score](https://img.shields.io/badge/Best_Weighted_Score-91.3%25-brightgreen)](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier)

</div>

---

## Abstract

Denali AI develops and benchmarks vision-language models (VLMs) for **structured garment attribute extraction** — the task of analyzing a garment image and producing a complete JSON object describing 9 key attributes: type, color, pattern, neckline, sleeve length, closure, brand, size, and defect type.

We systematically evaluate the impact of **supervised fine-tuning (SFT)**, **Group Relative Policy Optimization (GRPO)**, and **Group-relative Trajectory-based Policy Optimization (GTPO)** across multiple model architectures (Qwen3-VL, Qwen3.5-VL, InternVL3, Florence-2, Moondream2, Phi-4) and scales (1.6B to 122B parameters). Our best model, **[Qwen3-VL-8B SFT+GRPO](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier)**, achieves **91.3% weighted score** with **100% JSON parse rate** on the eval_hard_3500 benchmark.

---

## Leaderboard

![Model Leaderboard](https://huggingface.co/Denali-AI/org-assets/resolve/main/leaderboard.png)

| Rank | Model | Architecture | Params | Training | Weighted | SBERT+NLI | JSON Parse | Throughput |
|:----:|-------|:------------:|:------:|:--------:|:--------:|:---------:|:----------:|:----------:|
| 1 | [Qwen3-VL-8B SFT+GRPO](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier) | Qwen3-VL | 8B | SFT+GRPO | 91.3% | 78.7% | 100% | 7.5/s |
| 2 | [Qwen3-VL-2B-SFT-GRPO-v9](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9) | Qwen3-VL | 2B | SFT+GRPO | 89.5% | 78.5% | 100% | 15.9/s |
| 3 | [Qwen3-VL-8B SFT+GRPO NVFP4](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier-nvfp4) | Qwen3-VL | 8B | SFT+GRPO | 89.5% | 77.0% | 100% | 12.1/s |
| 4 | [Qwen3-VL-8B-Instruct-Base](https://huggingface.co/Denali-AI/qwen3-vl-8b-instruct-base) | Qwen3-VL | 8B | Zero-shot | 87.5% | 75.6% | 100% | 5.5/s |
| 5 | [Qwen3-VL-8B-Instruct NVFP4](https://huggingface.co/Denali-AI/qwen3-vl-8b-instruct-nvfp4) | Qwen3-VL | 8B | Zero-shot | 87.2% | 75.0% | 100% | 8.2/s |
| 6 | [Qwen3.5-VL-2B Base](https://huggingface.co/Denali-AI/qwen35-2b-base) | Qwen3.5-VL | 2B | Zero-shot | 84.4% | 73.0% | 100% | 6.6/s |
| 7 | [Qwen3-VL-2B SFT+GRPO v9 NVFP4](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9-nvfp4) | Qwen3-VL | 2B | SFT+GRPO | 84.2% | 74.1% | 100% | 17.2/s |
| 8 | [Qwen3-VL-2B-Instruct Base](https://huggingface.co/Denali-AI/qwen3-vl-2b-instruct-base) | Qwen3-VL | 2B | Zero-shot | 76.4% | 66.7% | 100% | 15.1/s |
| 9 | [InternVL3-2B GRPO+GTPO Full](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-full) | InternVL3 | 2B | GRPO+GTPO | 72.7% | 64.3% | 100% | 11.8/s |
| 10 | [InternVL3-2B GRPO+GTPO FP8](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-fp8) | InternVL3 | 2B | GRPO+GTPO | 72.2% | 63.8% | 100% | 14.3/s |
| 11 | [InternVL3-2B Base](https://huggingface.co/Denali-AI/internvl3-2b-base) | InternVL3 | 2B | Zero-shot | 71.8% | 63.7% | 100% | 11.8/s |
| 12 | [Moondream2 Base](https://huggingface.co/Denali-AI/moondream2-base) | Moondream2 | 1.6B | Zero-shot | 69.8% | 61.8% | 100% | 1.4/s |
| 13 | [Qwen3.5-VL-2B SFT+GRPO+GTPO](https://huggingface.co/Denali-AI/qwen35-2b-sft-grpo-gtpo-merged) | Qwen3.5-VL | 2B | SFT+GRPO+GTPO | 65.3% | 60.1% | 100% | 11.3/s |
| 14 | [Qwen3.5-VL-2B SFT](https://huggingface.co/Denali-AI/qwen35-2b-sft-merged) | Qwen3.5-VL | 2B | SFT | 63.7% | 58.9% | 100% | 11.6/s |
| 15 | [Qwen3.5-VL-35B GPTQ-Int4](https://huggingface.co/Denali-AI/qwen35-35b-a3b-gptq-int4) | Qwen3.5-VL MoE | 35B (3B) | Zero-shot | 50.7% | 48.7% | 14% | 1.6/s |
| 16 | Qwen3.5-VL-9B NVFP4 | Qwen3.5-VL | 9B | Zero-shot | 47.0% | 46.0% | 8% | 1.7/s |
| 17 | [Qwen3.5-VL-9B SFT NVFP4](https://huggingface.co/Denali-AI/qwen35-9b-sft-nvfp4) | Qwen3.5-VL | 9B | SFT | 46.3% | 45.5% | 8% | 1.7/s |
| 18 | Qwen3.5-VL-2B Base NVFP4 | Qwen3.5-VL | 2B | Zero-shot | 42.9% | 42.9% | 0% | 4.0/s |
| 19 | [Qwen3.5-VL-122B NVFP4](https://huggingface.co/Denali-AI/qwen35-122b-a10b-nvfp4) | Qwen3.5-VL MoE | 122B (10B) | Zero-shot | 42.9% | 42.9% | 0% | 1.2/s |
| 20 | [Qwen3.5-VL-2B SFT NVFP4](https://huggingface.co/Denali-AI/qwen35-2b-sft-nvfp4) | Qwen3.5-VL | 2B | SFT | 42.9% | 42.9% | 0% | 4.0/s |
| 21 | Qwen3.5-VL-2B SFT+GRPO+GTPO NVFP4 | Qwen3.5-VL | 2B | SFT+GRPO+GTPO | 42.9% | 42.9% | 0% | 3.9/s |
| 22 | [Phi-4-Multimodal NVFP4](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | Phi-4 | 5.6B | Zero-shot | 42.9% | 42.9% | 0% | — |

> **Note:** Models ranked 18-22 have 0% JSON parse rate under NVFP4 quantization, meaning they cannot produce valid structured output — their weighted scores reflect the 42.9% floor from partial field matches in malformed outputs. Fine-tuning is required to unlock their potential.

---

## Task Definition

Given a single garment image, the model must extract **9 structured attributes** as a valid JSON object:

```json
{
  "type": "t-shirt",
  "color": "navy blue",
  "pattern": "solid",
  "neckline": "crew neck",
  "sleeve_length": "short sleeve",
  "closure": "pullover",
  "brand": "Nike",
  "size": "M",
  "defect_type": "small hole on left shoulder"
}
```

### Field Importance Weights

Not all fields are equally important. The weighted score uses domain-specific multipliers:

![Field Weights](https://huggingface.co/Denali-AI/org-assets/resolve/main/field_weights.png)

| Field | Weight | Rationale |
|-------|:------:|-----------|
| **Type** | 2.5x | Critical for inventory routing and categorization |
| **Defect** | 2.0x | Directly impacts quality control and pricing |
| **Brand** | 1.5x | Essential for authentication and valuation |
| **Size** | 1.5x | Required for accurate listing and search |
| Color, Pattern, Neckline, Sleeve, Closure | 1.0x | Standard descriptive attributes |

---

## Key Results

### Per-Field Performance

![Radar Comparison](https://huggingface.co/Denali-AI/org-assets/resolve/main/radar_comparison.png)

![Performance Heatmap](https://huggingface.co/Denali-AI/org-assets/resolve/main/heatmap.png)

### Accuracy vs Throughput

![Throughput Analysis](https://huggingface.co/Denali-AI/org-assets/resolve/main/throughput_scatter.png)

**Key finding:** Qwen3-VL-2B v9 NVFP4 achieves the best accuracy-throughput trade-off at 84.2% weighted score and 17.2 samples/s — making it the Pareto-optimal choice for production deployment. For maximum accuracy, the Qwen3-VL-8B SFT+GRPO reaches 91.3% at 7.5 samples/s.

### Structured Output Reliability

![JSON Parse Rates](https://huggingface.co/Denali-AI/org-assets/resolve/main/json_parse.png)

Fine-tuned models achieve **100% JSON parse rate**, while zero-shot baselines (GPTQ, NVFP4) fail to produce valid JSON in 86-100% of cases. This demonstrates that **SFT is essential** for teaching structured output format, regardless of model scale.

### Impact of Training Stages

![Training Impact](https://huggingface.co/Denali-AI/org-assets/resolve/main/training_impact.png)

**Left panel:** Adding GRPO+GTPO to Qwen3.5-2B improves brand recognition from 15.6% to 24.8% and defect detection from 89.5% to 95.1%, with a +1.6% overall gain.

**Right panel:** FP8 quantization of InternVL3-2B shows <1% accuracy degradation across all fields while reducing memory footprint, confirming FP8 as a practical deployment optimization.

---

## Model Collections

### By Architecture

| Collection | Models | Description |
|------------|:------:|-------------|
| [**Qwen3-VL**](https://huggingface.co/collections/Denali-AI/qwen3-vl-models-69c70950fca01f437228c29b) | 7 | Top-performing Qwen3-VL based models (2B and 8B) |
| [**Qwen3.5-VL**](https://huggingface.co/collections/Denali-AI/qwen35-vl-models-69c70802ab21ae73a116cc92) | 10 | Qwen3.5-VL models (0.8B to 122B) |
| [**InternVL3**](https://huggingface.co/collections/Denali-AI/internvl3-models-69c70803ab21ae73a116cca2) | 6 | InternVL3 models (1B, 2B) |
| [**Florence-2**](https://huggingface.co/collections/Denali-AI/florence-2-models-69c70802f1456fd2264216e8) | 3 | Florence-2 encoder-decoder models |
| [**Benchmarks**](https://huggingface.co/collections/Denali-AI/benchmarks-and-datasets-69c708037d77aba79963c1a7) | 2 | Evaluation and training datasets |

---

## Training Pipeline

All fine-tuned models follow the **Denali-AI Multi-Stage RL Pipeline**:

```
                    ┌─────────────────────────────────────────────────┐
                    │           Denali-AI Training Pipeline            │
                    └─────────────────────────────────────────────────┘
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    ▼                     ▼                     ▼
              ┌──────────┐        ┌──────────────┐      ┌──────────────┐
              │  Stage 1  │        │   Stage 2    │      │   Stage 3    │
              │   SFT     │───────▶│    GRPO      │─────▶│    GTPO      │
              │  (LoRA)   │        │  (Rewards)   │      │ (Trajectory) │
              └──────────┘        └──────────────┘      └──────────────┘
                    │                     │                     │
              JSON format          Field accuracy         Coherence &
              acquisition          optimization           regularization
```

### Stage 1: Supervised Fine-Tuning (SFT)

- **Method:** LoRA (r=16, alpha=32) on frozen base model
- **Data:** [train-10k-balanced-v3](https://huggingface.co/datasets/Denali-AI/train-10k-balanced-v3) — 10,000 curated samples
- **Objective:** Teach valid JSON output format and basic field extraction
- **Key outcome:** 100% JSON parse rate

### Stage 2: Group Relative Policy Optimization (GRPO)

- **Method:** Reward-based RL without a critic model
- **Reward engine:** 3-layer scoring system
  - Layer 1: JSON validity gate (binary)
  - Layer 2: Structural correctness (20% weight)
  - Layer 3: Per-field content accuracy (80% weight)
- **Key outcome:** Improved field-level accuracy, especially for challenging fields

### Stage 3: Group-relative Trajectory-based Policy Optimization (GTPO)

- **Method:** Conflict-aware gradient optimization with entropy regularization
- **Key outcome:** Trajectory-level coherence and reduced field-level conflicts

---

## Evaluation Methodology

### Benchmark

All models are evaluated on [**eval_hard_3500**](https://huggingface.co/datasets/Denali-AI/eval-hard-3500) — a curated benchmark of 3,500 challenging garment images selected for diversity in:
- Garment type (tops, bottoms, dresses, outerwear, accessories)
- Visual complexity (patterns, prints, multi-color)
- Edge cases (ambiguous attributes, partially visible labels)

### Metrics

We employ a **comprehensive multi-metric evaluation framework** rather than relying on exact match. Each metric captures a different dimension of prediction quality:

| Metric | Model | Description |
|--------|-------|-------------|
| **SBERT Cosine** | all-MiniLM-L6-v2 | Semantic similarity via sentence embeddings |
| **NLI Score** | nli-MiniLM2-L6-H768 | Natural language inference entailment |
| **Levenshtein Ratio** | — | Fuzzy string matching distance |
| **Token F1** | — | Token-level precision and recall |
| **SBERT+NLI Combined** | — | Primary metric: average of SBERT cosine and NLI |
| **Weighted Score** | — | Field-weighted aggregate (see weights above) |

<details>
<summary><b>Metric Definitions (click to expand)</b></summary>

#### SBERT Cosine Similarity
Measures how semantically close the predicted value is to the ground truth by encoding both strings into dense vector embeddings using the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) sentence-transformer model and computing their cosine similarity. A score of 1.0 means the embeddings are identical in direction (semantically equivalent), while 0.0 means they are orthogonal (unrelated). This captures meaning-level similarity — for example, "navy blue" and "dark blue" score high despite being different strings. Values are thresholded: scores above 0.85 map to full credit, scores below 0.50 map to zero, and values in between are linearly scaled.

#### NLI Score (Natural Language Inference)
Uses a cross-encoder NLI model ([nli-MiniLM2-L6-H768](https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768)) to determine whether the predicted value *entails*, *contradicts*, or is *neutral* to the ground truth. The model evaluates the pair as a premise-hypothesis pair (e.g., "the color is navy blue" vs "the color is dark blue"). Entailment probability above 0.6 yields a score of at least 0.8; contradiction probability above 0.6 heavily penalizes the score (scaled down to 30% of base). This metric is particularly valuable for detecting semantic contradictions that string-level metrics would miss — e.g., "long sleeve" vs "short sleeve" are textually similar but semantically opposite.

#### Levenshtein Ratio
Computes the normalized edit distance between the predicted and ground-truth strings (after lowercasing and stripping). The ratio is `1 - (edit_distance / max_length)`, ranging from 0.0 (completely different) to 1.0 (identical). This character-level metric catches minor spelling variations and typos — for example, "pullover" vs "pull-over" score nearly 1.0. It complements the semantic metrics by providing a surface-level similarity signal that is model-free and deterministic.

#### Token F1
Computes token-level precision and recall by treating the predicted and ground-truth strings as bags of whitespace-delimited tokens. Precision is the fraction of predicted tokens that appear in the ground truth; recall is the fraction of ground-truth tokens that appear in the prediction. F1 is their harmonic mean. This metric handles multi-word values well — "light blue cotton" vs "blue cotton" gets partial credit for the overlapping tokens, unlike exact match which would score 0. Particularly useful for defect descriptions and color fields where partial matches are meaningful.

#### SBERT+NLI Combined
The **primary evaluation metric** used for ranking models. It combines SBERT cosine similarity and NLI scoring in a cascaded approach inspired by the training reward engine: first, the SBERT cosine score is mapped to a base score (1.0 if cosine >= 0.85, linearly scaled between 0.50-0.85, 0.0 below 0.50). Then, NLI adjusts this base: if the NLI model detects strong entailment (>0.6), the score is boosted to at least 0.8; if it detects strong contradiction (>0.6), the score is reduced to 30% of the base. This two-stage approach leverages both embedding similarity and logical inference for robust evaluation.

#### Weighted Score
The **headline metric** for model comparison. It multiplies each field's SBERT+NLI Combined score by its domain-specific importance weight (type=2.5x, defect=2.0x, brand=1.5x, size=1.5x, others=1.0x) and normalizes by the total weight. This reflects real-world value — correctly identifying garment type and defects matters more than getting the closure style right. A hallucination (predicting a value when ground truth is null) incurs a -0.3 penalty to discourage false positives. The weighted score ranges from 0% to 100%, with our best model achieving 91.3%.

#### JSON Parse Rate
The percentage of model outputs that are valid, parseable JSON objects. Fine-tuned models achieve 100%; zero-shot models often fail at 0-14%. This is a binary pass/fail gate — if the output cannot be parsed as JSON, all field scores for that sample are 0.

#### Throughput
End-to-end inference speed measured in samples per second, including network overhead, across 8 concurrent workers hitting a vLLM server. Higher throughput indicates better production viability. Measured on NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM).

</details>

This multi-metric approach captures semantic similarity rather than requiring exact string matches, which is critical for fields like color ("navy blue" vs "dark blue") and defect descriptions.

### Evaluation Protocol

- **Inference:** 8 concurrent workers via OpenAI-compatible API (vLLM)
- **Samples:** All 3,500 samples, no subsampling
- **Compute:** NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)
- **Reproducibility:** Fixed prompts, deterministic sampling (temperature=0)

---

## Key Findings

1. **Qwen3-VL-8B SFT+GRPO is the new champion at 91.3%.** Fine-tuning the 8B model with SFT+GRPO surpasses the previous best (2B v9 at 89.5%) while maintaining 100% JSON parse rate.

2. **Architecture matters more than scale.** The 2B Qwen3-VL (89.5%) outperforms the 35B Qwen3.5 MoE (50.7%) by a wide margin, and even the zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned Qwen3.5-VL models.

3. **SFT is non-negotiable for structured output.** All fine-tuned models achieve 100% JSON parse rate; all zero-shot NVFP4/GPTQ models fail at 0-14%. No amount of model scale compensates for the lack of format training.

4. **NVFP4 quantization preserves accuracy for Qwen3-VL.** The 8B NVFP4 variant loses only 1.8pp (91.3% vs 89.5%) while gaining 61% throughput (7.5 vs 12.1 samples/s). The 2B NVFP4 loses 5.3pp but gains 8% throughput.

5. **FP8 quantization is effectively free.** InternVL3-2B loses <1% accuracy with FP8, while gaining 21% throughput improvement (11.8 vs 14.3 samples/s).

6. **Qwen3-VL dominates all scales.** The top 8 models are all Qwen3-VL variants. Even zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned InternVL3 and Qwen3.5-VL models.

7. **RL provides meaningful but modest gains.** GRPO+GTPO adds +1.6% weighted score over SFT-only for Qwen3.5-2B, with the largest gains on brand (+9.2pp) and defect (+5.6pp).

---

## Research Directions & Future Work

### Near-Term Improvements

| Direction | Expected Impact | Effort |
|-----------|:--------------:|:------:|
| **GTPO on Qwen3-VL-8B SFT+GRPO** | +1-3pp weighted (add trajectory optimization to the #1 model) | Low |
| **GTPO on Qwen3-VL-2B v9** | +2-4pp weighted (currently SFT+GRPO only) | Low |
| **SFT on Qwen3-VL-8B from zero-shot** | Push past 91.3% with better starting point | Low |
| **QLoRA on Qwen3.5-35B GPTQ** | JSON parse 14% -> 100%, weighted 50% -> ~80%+ | Low |
| **OCR pre-processing pipeline** | Fix brand/size for Qwen3.5 models (+30-60pp on those fields) | Medium |
| **Higher LoRA rank (r=32/64)** | +1-3pp from increased adapter capacity | Low |
| **Guided JSON decoding** | Force 100% JSON parse on zero-shot models without training | Low |

### Architecture Exploration

Models we haven't tested but are strong candidates:

| Model | Parameters | Why Promising |
|-------|:----------:|---------------|
| **[InternVL3-4B](https://huggingface.co/OpenGVLab/InternVL3-4B)** | 4B | Mid-range InternVL — may close gap to Qwen3-VL |
| **[SmolVLM2-2.2B](https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct)** | 2.2B | HuggingFace's efficient VLM — strong structured output |
| **[PaliGemma2-3B](https://huggingface.co/google/paligemma2-3b-pt-448)** | 3B | Google VLM with excellent OCR — may solve brand/size |
| **[MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6)** | 2.8B | Strong small VLM with good OCR capabilities |
| **[Qwen3-VL-32B](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)** | 32B | Largest Qwen3-VL — given 8B dominance, could push past 95% |

### Long-Term Research

1. **Ensemble routing:** Use a lightweight classifier to route each field to the best-performing model (e.g., Qwen3-VL for visual attributes, InternVL3 for brand/size)
2. **Curriculum learning:** Progressive difficulty training — easy garments first, hard edge cases last
3. **Synthetic data generation:** Use large VLMs (122B) to generate training labels for unlabeled garment images at scale
4. **Multi-image input:** Leverage front + back + tag images simultaneously for higher accuracy
5. **Active learning:** Identify samples where models disagree most and prioritize annotation of those

### Key Open Questions

- Why does Qwen3-VL dramatically outperform Qwen3.5-VL at the same scale? Is it the vision encoder, the cross-attention mechanism, or training data?
- Can RL gains be amplified beyond +1.8pp on the 8B model? Current GRPO hyperparameters may be suboptimal
- Is there a parameter count sweet spot between 8B and 32B where accuracy saturates?
- Would instruction-tuned base models (vs base models) yield better SFT starting points?

---

## Datasets

| Dataset | Samples | Purpose | Link |
|---------|:-------:|---------|------|
| **eval_hard_3500** | 3,500 | Evaluation benchmark (hard subset) | [Link](https://huggingface.co/datasets/Denali-AI/eval-hard-3500) |
| **train_10k_balanced_v3** | 10,000 | Training data (balanced sampling) | [Link](https://huggingface.co/datasets/Denali-AI/train-10k-balanced-v3) |

---

## Citation

```bibtex
@misc{denali-ai-2026,
  title={Structured Garment Attribute Extraction via Multi-Stage Reinforcement Learning},
  author={Denali AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Denali-AI}
}
```

## License

All models and datasets are released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

## Contact

- **Organization:** [Denali Advanced Integration](https://denaliai.com)
- **Issues:** [GitHub](https://github.com/Denali-AI)
- **HuggingFace:** [Denali-AI](https://huggingface.co/Denali-AI)