msudharsanan commited on
Commit
dccd089
Β·
verified Β·
1 Parent(s): 20903ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +241 -7
README.md CHANGED
@@ -1,10 +1,244 @@
1
  ---
2
- title: README
3
- emoji: πŸ”₯
4
- colorFrom: yellow
5
- colorTo: blue
6
- sdk: static
7
- pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Denali AI
3
+ short_description: Vision-Language Models for Garment Classification
 
 
 
 
4
  ---
5
 
6
+ # Denali AI β€” Vision-Language Models for Garment Classification
7
+
8
+ <div align="center">
9
+
10
+ **Advancing structured attribute extraction from garment images through multi-stage reinforcement learning**
11
+
12
+ [![Models](https://img.shields.io/badge/Models-16-blue)](https://huggingface.co/Denali-AI)
13
+ [![Benchmark](https://img.shields.io/badge/Benchmark-3%2C500_samples-green)](https://huggingface.co/datasets/Denali-AI/eval-hard-3500)
14
+ [![License](https://img.shields.io/badge/License-Apache_2.0-orange)](https://www.apache.org/licenses/LICENSE-2.0)
15
+ [![Best Score](https://img.shields.io/badge/Best_Weighted_Score-89.5%25-brightgreen)](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9)
16
+
17
+ </div>
18
+
19
+ ---
20
+
21
+ ## Abstract
22
+
23
+ Denali AI develops and benchmarks vision-language models (VLMs) for **structured garment attribute extraction** β€” the task of analyzing a garment image and producing a complete JSON object describing 9 key attributes: type, color, pattern, neckline, sleeve length, closure, brand, size, and defect type.
24
+
25
+ We systematically evaluate the impact of **supervised fine-tuning (SFT)**, **Group Relative Policy Optimization (GRPO)**, and **Group-relative Trajectory-based Policy Optimization (GTPO)** across multiple model architectures (Qwen3-VL, Qwen3.5-VL, InternVL3, Florence-2) and scales (0.8B to 122B parameters). Our best model, **Qwen3-VL-2B SFT+GRPO v9**, achieves **89.5% weighted score** with **100% JSON parse rate** on the eval_hard_3500 benchmark.
26
+
27
+ ---
28
+
29
+ ## Leaderboard
30
+
31
+ ![Model Leaderboard](https://huggingface.co/Denali-AI/org-assets/resolve/main/leaderboard.png)
32
+
33
+ | Rank | Model | Architecture | Params | Training | Weighted | SBERT+NLI | JSON% | Throughput |
34
+ |:----:|-------|-------------|:------:|----------|:--------:|:---------:|:-----:|:----------:|
35
+ | 1 | **[Qwen3-VL-2B SFT+GRPO v9](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9)** | Qwen3-VL | 2B | SFT+GRPO | **89.5%** | 78.5% | 100% | 15.9/s |
36
+ | 2 | [InternVL3-2B GRPO+GTPO Full](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-full) | InternVL3 | 2B | GRPO+GTPO | **72.7%** | 64.3% | 100% | 11.8/s |
37
+ | 3 | [InternVL3-2B GRPO+GTPO FP8](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-fp8) | InternVL3 | 2B | GRPO+GTPO | **72.2%** | 63.8% | 100% | 14.3/s |
38
+ | 4 | [Qwen3.5-2B SFT+GRPO+GTPO v8](https://huggingface.co/Denali-AI/qwen35-2b-sft-grpo-gtpo-merged) | Qwen3.5-VL | 2B | SFT+GRPO+GTPO | **65.3%** | 60.1% | 100% | 11.3/s |
39
+ | 5 | [Qwen3.5-2B SFT v7](https://huggingface.co/Denali-AI/qwen35-2b-sft-merged) | Qwen3.5-VL | 2B | SFT | **63.7%** | 58.9% | 100% | 11.6/s |
40
+ | 6 | [Qwen3.5-35B GPTQ-Int4](https://huggingface.co/Denali-AI/qwen35-35b-a3b-gptq-int4) | Qwen3.5 MoE | 35B (3B) | Zero-shot | **50.7%** | 48.7% | 14% | 1.6/s |
41
+ | 7 | Qwen3.5-9B NVFP4 v10 | Qwen3.5-VL | 9B | Zero-shot | **47.0%** | 46.0% | 8% | 1.7/s |
42
+ | 8 | Qwen3.5-2B NVFP4 v10 | Qwen3.5-VL | 2B | Zero-shot | **42.9%** | 42.9% | 0% | 4.0/s |
43
+
44
+ ---
45
+
46
+ ## Task Definition
47
+
48
+ Given a single garment image, the model must extract **9 structured attributes** as a valid JSON object:
49
+
50
+ ```json
51
+ {
52
+ "type": "t-shirt",
53
+ "color": "navy blue",
54
+ "pattern": "solid",
55
+ "neckline": "crew neck",
56
+ "sleeve_length": "short sleeve",
57
+ "closure": "pullover",
58
+ "brand": "Nike",
59
+ "size": "M",
60
+ "defect_type": "small hole on left shoulder"
61
+ }
62
+ ```
63
+
64
+ ### Field Importance Weights
65
+
66
+ Not all fields are equally important. The weighted score uses domain-specific multipliers:
67
+
68
+ ![Field Weights](https://huggingface.co/Denali-AI/org-assets/resolve/main/field_weights.png)
69
+
70
+ | Field | Weight | Rationale |
71
+ |-------|:------:|-----------|
72
+ | **Type** | 2.5x | Critical for inventory routing and categorization |
73
+ | **Defect** | 2.0x | Directly impacts quality control and pricing |
74
+ | **Brand** | 1.5x | Essential for authentication and valuation |
75
+ | **Size** | 1.5x | Required for accurate listing and search |
76
+ | Color, Pattern, Neckline, Sleeve, Closure | 1.0x | Standard descriptive attributes |
77
+
78
+ ---
79
+
80
+ ## Key Results
81
+
82
+ ### Per-Field Performance
83
+
84
+ ![Radar Comparison](https://huggingface.co/Denali-AI/org-assets/resolve/main/radar_comparison.png)
85
+
86
+ ![Performance Heatmap](https://huggingface.co/Denali-AI/org-assets/resolve/main/heatmap.png)
87
+
88
+ ### Accuracy vs Throughput
89
+
90
+ ![Throughput Analysis](https://huggingface.co/Denali-AI/org-assets/resolve/main/throughput_scatter.png)
91
+
92
+ **Key finding:** Qwen3-VL-2B v9 achieves the best accuracy-throughput trade-off at 89.5% weighted score and 15.9 samples/s β€” making it the Pareto-optimal choice for production deployment.
93
+
94
+ ### Structured Output Reliability
95
+
96
+ ![JSON Parse Rates](https://huggingface.co/Denali-AI/org-assets/resolve/main/json_parse.png)
97
+
98
+ Fine-tuned models achieve **100% JSON parse rate**, while zero-shot baselines (GPTQ, NVFP4) fail to produce valid JSON in 86-100% of cases. This demonstrates that **SFT is essential** for teaching structured output format, regardless of model scale.
99
+
100
+ ### Impact of Training Stages
101
+
102
+ ![Training Impact](https://huggingface.co/Denali-AI/org-assets/resolve/main/training_impact.png)
103
+
104
+ **Left panel:** Adding GRPO+GTPO to Qwen3.5-2B improves brand recognition from 15.6% to 24.8% and defect detection from 89.5% to 95.1%, with a +1.6% overall gain.
105
+
106
+ **Right panel:** FP8 quantization of InternVL3-2B shows <1% accuracy degradation across all fields while reducing memory footprint, confirming FP8 as a practical deployment optimization.
107
+
108
+ ---
109
+
110
+ ## Model Collections
111
+
112
+ ### By Architecture
113
+
114
+ | Collection | Models | Description |
115
+ |------------|:------:|-------------|
116
+ | [**Qwen3-VL**](https://huggingface.co/collections/Denali-AI/qwen3-vl-models-69c70950fca01f437228c29b) | 1 | Top-performing Qwen3-VL based models |
117
+ | [**Qwen3.5-VL**](https://huggingface.co/collections/Denali-AI/qwen35-vl-models-69c70802ab21ae73a116cc92) | 7 | Qwen3.5-VL models (0.8B to 122B) |
118
+ | [**InternVL3**](https://huggingface.co/collections/Denali-AI/internvl3-models-69c70803ab21ae73a116cca2) | 5 | InternVL3 models (1B, 2B) |
119
+ | [**Florence-2**](https://huggingface.co/collections/Denali-AI/florence-2-models-69c70802f1456fd2264216e8) | 3 | Florence-2 encoder-decoder models |
120
+ | [**Benchmarks**](https://huggingface.co/collections/Denali-AI/benchmarks-and-datasets-69c708037d77aba79963c1a7) | 2 | Evaluation and training datasets |
121
+
122
+ ---
123
+
124
+ ## Training Pipeline
125
+
126
+ All fine-tuned models follow the **Denali-AI Multi-Stage RL Pipeline**:
127
+
128
+ ```
129
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
130
+ β”‚ Denali-AI Training Pipeline β”‚
131
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
132
+ β”‚
133
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
134
+ β–Ό β–Ό β–Ό
135
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
136
+ β”‚ Stage 1 β”‚ β”‚ Stage 2 β”‚ β”‚ Stage 3 β”‚
137
+ β”‚ SFT │───────▢│ GRPO │─────▢│ GTPO β”‚
138
+ β”‚ (LoRA) β”‚ β”‚ (Rewards) β”‚ β”‚ (Trajectory) β”‚
139
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
140
+ β”‚ β”‚ β”‚
141
+ JSON format Field accuracy Coherence &
142
+ acquisition optimization regularization
143
+ ```
144
+
145
+ ### Stage 1: Supervised Fine-Tuning (SFT)
146
+
147
+ - **Method:** LoRA (r=16, alpha=32) on frozen base model
148
+ - **Data:** [train-10k-balanced-v3](https://huggingface.co/datasets/Denali-AI/train-10k-balanced-v3) β€” 10,000 curated samples
149
+ - **Objective:** Teach valid JSON output format and basic field extraction
150
+ - **Key outcome:** 100% JSON parse rate
151
+
152
+ ### Stage 2: Group Relative Policy Optimization (GRPO)
153
+
154
+ - **Method:** Reward-based RL without a critic model
155
+ - **Reward engine:** 3-layer scoring system
156
+ - Layer 1: JSON validity gate (binary)
157
+ - Layer 2: Structural correctness (20% weight)
158
+ - Layer 3: Per-field content accuracy (80% weight)
159
+ - **Key outcome:** Improved field-level accuracy, especially for challenging fields
160
+
161
+ ### Stage 3: Group-relative Trajectory-based Policy Optimization (GTPO)
162
+
163
+ - **Method:** Conflict-aware gradient optimization with entropy regularization
164
+ - **Key outcome:** Trajectory-level coherence and reduced field-level conflicts
165
+
166
+ ---
167
+
168
+ ## Evaluation Methodology
169
+
170
+ ### Benchmark
171
+
172
+ All models are evaluated on [**eval_hard_3500**](https://huggingface.co/datasets/Denali-AI/eval-hard-3500) β€” a curated benchmark of 3,500 challenging garment images selected for diversity in:
173
+ - Garment type (tops, bottoms, dresses, outerwear, accessories)
174
+ - Visual complexity (patterns, prints, multi-color)
175
+ - Edge cases (ambiguous attributes, partially visible labels)
176
+
177
+ ### Metrics
178
+
179
+ We employ a **comprehensive multi-metric evaluation framework** rather than relying on exact match:
180
+
181
+ | Metric | Model | Description |
182
+ |--------|-------|-------------|
183
+ | **SBERT Cosine** | all-MiniLM-L6-v2 | Semantic similarity via sentence embeddings |
184
+ | **NLI Score** | nli-MiniLM2-L6-H768 | Natural language inference entailment |
185
+ | **Levenshtein Ratio** | β€” | Fuzzy string matching distance |
186
+ | **Token F1** | β€” | Token-level precision and recall |
187
+ | **SBERT+NLI Combined** | β€” | Primary metric: average of SBERT cosine and NLI |
188
+ | **Weighted Score** | β€” | Field-weighted aggregate (see weights above) |
189
+
190
+ This multi-metric approach captures semantic similarity rather than requiring exact string matches, which is critical for fields like color ("navy blue" vs "dark blue") and defect descriptions.
191
+
192
+ ### Evaluation Protocol
193
+
194
+ - **Inference:** 8 concurrent workers via OpenAI-compatible API (vLLM)
195
+ - **Samples:** All 3,500 samples, no subsampling
196
+ - **Compute:** NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)
197
+ - **Reproducibility:** Fixed prompts, deterministic sampling (temperature=0)
198
+
199
+ ---
200
+
201
+ ## Key Findings
202
+
203
+ 1. **Architecture matters more than scale.** The 2B Qwen3-VL (89.5%) outperforms the 35B Qwen3.5 MoE (50.7%) by a wide margin, largely due to the zero-shot model's inability to produce valid JSON.
204
+
205
+ 2. **SFT is non-negotiable for structured output.** All fine-tuned models achieve 100% JSON parse rate; all zero-shot models fail at 0-14%. No amount of model scale compensates for the lack of format training.
206
+
207
+ 3. **RL provides meaningful but modest gains.** GRPO+GTPO adds +1.6% weighted score over SFT-only for Qwen3.5-2B, with the largest gains on brand (+9.2pp) and defect (+5.6pp).
208
+
209
+ 4. **FP8 quantization is effectively free.** InternVL3-2B loses <1% accuracy with FP8, while gaining 21% throughput improvement (11.8 vs 14.3 samples/s).
210
+
211
+ 5. **Brand and size are the hardest fields.** Even the best model (v9) achieves only 89.3% on brand and 95.8% on size, while defect detection reaches 97.2%.
212
+
213
+ ---
214
+
215
+ ## Datasets
216
+
217
+ | Dataset | Samples | Purpose | Link |
218
+ |---------|:-------:|---------|------|
219
+ | **eval_hard_3500** | 3,500 | Evaluation benchmark (hard subset) | [Link](https://huggingface.co/datasets/Denali-AI/eval-hard-3500) |
220
+ | **train_10k_balanced_v3** | 10,000 | Training data (balanced sampling) | [Link](https://huggingface.co/datasets/Denali-AI/train-10k-balanced-v3) |
221
+
222
+ ---
223
+
224
+ ## Citation
225
+
226
+ ```bibtex
227
+ @misc{denali-ai-2026,
228
+ title={Structured Garment Attribute Extraction via Multi-Stage Reinforcement Learning},
229
+ author={Denali AI},
230
+ year={2026},
231
+ publisher={HuggingFace},
232
+ url={https://huggingface.co/Denali-AI}
233
+ }
234
+ ```
235
+
236
+ ## License
237
+
238
+ All models and datasets are released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
239
+
240
+ ## Contact
241
+
242
+ - **Organization:** [Denali Advanced Integration](https://denaliai.com)
243
+ - **Issues:** [GitHub](https://github.com/Denali-AI)
244
+ - **HuggingFace:** [Denali-AI](https://huggingface.co/Denali-AI)