Spaces:

Denali-AI
/

README

Configuration error

App Files Files Community

msudharsanan commited on 8 days ago

Commit

dccd089

verified ·

1 Parent(s): 20903ae

Update README.md

Browse files

Files changed (1) hide show

README.md +241 -7

README.md CHANGED Viewed

@@ -1,10 +1,244 @@
 ---
-title: README
-emoji: 🔥
-colorFrom: yellow
-colorTo: blue
-sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 ---
+title: Denali AI
+short_description: Vision-Language Models for Garment Classification
 ---
+# Denali AI — Vision-Language Models for Garment Classification
+<div align="center">
+**Advancing structured attribute extraction from garment images through multi-stage reinforcement learning**
+[![Models](https://img.shields.io/badge/Models-16-blue)](https://huggingface.co/Denali-AI)
+[![Benchmark](https://img.shields.io/badge/Benchmark-3%2C500_samples-green)](https://huggingface.co/datasets/Denali-AI/eval-hard-3500)
+[![License](https://img.shields.io/badge/License-Apache_2.0-orange)](https://www.apache.org/licenses/LICENSE-2.0)
+[![Best Score](https://img.shields.io/badge/Best_Weighted_Score-89.5%25-brightgreen)](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9)
+</div>
+---
+## Abstract
+Denali AI develops and benchmarks vision-language models (VLMs) for **structured garment attribute extraction** — the task of analyzing a garment image and producing a complete JSON object describing 9 key attributes: type, color, pattern, neckline, sleeve length, closure, brand, size, and defect type.
+We systematically evaluate the impact of **supervised fine-tuning (SFT)**, **Group Relative Policy Optimization (GRPO)**, and **Group-relative Trajectory-based Policy Optimization (GTPO)** across multiple model architectures (Qwen3-VL, Qwen3.5-VL, InternVL3, Florence-2) and scales (0.8B to 122B parameters). Our best model, **Qwen3-VL-2B SFT+GRPO v9**, achieves **89.5% weighted score** with **100% JSON parse rate** on the eval_hard_3500 benchmark.
+---
+## Leaderboard
+![Model Leaderboard](https://huggingface.co/Denali-AI/org-assets/resolve/main/leaderboard.png)
+| Rank | Model | Architecture | Params | Training | Weighted | SBERT+NLI | JSON% | Throughput |
+|:----:|-------|-------------|:------:|----------|:--------:|:---------:|:-----:|:----------:|
+| 1 | **[Qwen3-VL-2B SFT+GRPO v9](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9)** | Qwen3-VL | 2B | SFT+GRPO | **89.5%** | 78.5% | 100% | 15.9/s |
+| 2 | [InternVL3-2B GRPO+GTPO Full](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-full) | InternVL3 | 2B | GRPO+GTPO | **72.7%** | 64.3% | 100% | 11.8/s |
+| 3 | [InternVL3-2B GRPO+GTPO FP8](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-fp8) | InternVL3 | 2B | GRPO+GTPO | **72.2%** | 63.8% | 100% | 14.3/s |
+| 4 | [Qwen3.5-2B SFT+GRPO+GTPO v8](https://huggingface.co/Denali-AI/qwen35-2b-sft-grpo-gtpo-merged) | Qwen3.5-VL | 2B | SFT+GRPO+GTPO | **65.3%** | 60.1% | 100% | 11.3/s |
+| 5 | [Qwen3.5-2B SFT v7](https://huggingface.co/Denali-AI/qwen35-2b-sft-merged) | Qwen3.5-VL | 2B | SFT | **63.7%** | 58.9% | 100% | 11.6/s |
+| 6 | [Qwen3.5-35B GPTQ-Int4](https://huggingface.co/Denali-AI/qwen35-35b-a3b-gptq-int4) | Qwen3.5 MoE | 35B (3B) | Zero-shot | **50.7%** | 48.7% | 14% | 1.6/s |
+| 7 | Qwen3.5-9B NVFP4 v10 | Qwen3.5-VL | 9B | Zero-shot | **47.0%** | 46.0% | 8% | 1.7/s |
+| 8 | Qwen3.5-2B NVFP4 v10 | Qwen3.5-VL | 2B | Zero-shot | **42.9%** | 42.9% | 0% | 4.0/s |
+---
+## Task Definition
+Given a single garment image, the model must extract **9 structured attributes** as a valid JSON object:
+```json
+{
+  "type": "t-shirt",
+  "color": "navy blue",
+  "pattern": "solid",
+  "neckline": "crew neck",
+  "sleeve_length": "short sleeve",
+  "closure": "pullover",
+  "brand": "Nike",
+  "size": "M",
+  "defect_type": "small hole on left shoulder"
+}
+```
+### Field Importance Weights
+Not all fields are equally important. The weighted score uses domain-specific multipliers:
+![Field Weights](https://huggingface.co/Denali-AI/org-assets/resolve/main/field_weights.png)
+| Field | Weight | Rationale |
+|-------|:------:|-----------|
+| **Type** | 2.5x | Critical for inventory routing and categorization |
+| **Defect** | 2.0x | Directly impacts quality control and pricing |
+| **Brand** | 1.5x | Essential for authentication and valuation |
+| **Size** | 1.5x | Required for accurate listing and search |
+| Color, Pattern, Neckline, Sleeve, Closure | 1.0x | Standard descriptive attributes |
+---
+## Key Results
+### Per-Field Performance
+![Radar Comparison](https://huggingface.co/Denali-AI/org-assets/resolve/main/radar_comparison.png)
+![Performance Heatmap](https://huggingface.co/Denali-AI/org-assets/resolve/main/heatmap.png)
+### Accuracy vs Throughput
+![Throughput Analysis](https://huggingface.co/Denali-AI/org-assets/resolve/main/throughput_scatter.png)
+**Key finding:** Qwen3-VL-2B v9 achieves the best accuracy-throughput trade-off at 89.5% weighted score and 15.9 samples/s — making it the Pareto-optimal choice for production deployment.
+### Structured Output Reliability
+![JSON Parse Rates](https://huggingface.co/Denali-AI/org-assets/resolve/main/json_parse.png)
+Fine-tuned models achieve **100% JSON parse rate**, while zero-shot baselines (GPTQ, NVFP4) fail to produce valid JSON in 86-100% of cases. This demonstrates that **SFT is essential** for teaching structured output format, regardless of model scale.
+### Impact of Training Stages
+![Training Impact](https://huggingface.co/Denali-AI/org-assets/resolve/main/training_impact.png)
+**Left panel:** Adding GRPO+GTPO to Qwen3.5-2B improves brand recognition from 15.6% to 24.8% and defect detection from 89.5% to 95.1%, with a +1.6% overall gain.
+**Right panel:** FP8 quantization of InternVL3-2B shows <1% accuracy degradation across all fields while reducing memory footprint, confirming FP8 as a practical deployment optimization.
+---
+## Model Collections
+### By Architecture
+| Collection | Models | Description |
+|------------|:------:|-------------|
+| [**Qwen3-VL**](https://huggingface.co/collections/Denali-AI/qwen3-vl-models-69c70950fca01f437228c29b) | 1 | Top-performing Qwen3-VL based models |
+| [**Qwen3.5-VL**](https://huggingface.co/collections/Denali-AI/qwen35-vl-models-69c70802ab21ae73a116cc92) | 7 | Qwen3.5-VL models (0.8B to 122B) |
+| [**InternVL3**](https://huggingface.co/collections/Denali-AI/internvl3-models-69c70803ab21ae73a116cca2) | 5 | InternVL3 models (1B, 2B) |
+| [**Florence-2**](https://huggingface.co/collections/Denali-AI/florence-2-models-69c70802f1456fd2264216e8) | 3 | Florence-2 encoder-decoder models |
+| [**Benchmarks**](https://huggingface.co/collections/Denali-AI/benchmarks-and-datasets-69c708037d77aba79963c1a7) | 2 | Evaluation and training datasets |
+---
+## Training Pipeline
+All fine-tuned models follow the **Denali-AI Multi-Stage RL Pipeline**:
+```
+                    ┌─────────────────────────────────────────────────┐
+                    │           Denali-AI Training Pipeline            │
+                    └─────────────────────────────────────────────────┘
+                                          │
+                    ┌─────────────────────┼─────────────────────┐
+                    ▼                     ▼                     ▼
+              ┌──────────┐        ┌──────────────┐      ┌──────────────┐
+              │  Stage 1  │        │   Stage 2    │      │   Stage 3    │
+              │   SFT     │───────▶│    GRPO      │─────▶│    GTPO      │
+              │  (LoRA)   │        │  (Rewards)   │      │ (Trajectory) │
+              └──────────┘        └──────────────┘      └──────────────┘
+                    │                     │                     │
+              JSON format          Field accuracy         Coherence &
+              acquisition          optimization           regularization
+```
+### Stage 1: Supervised Fine-Tuning (SFT)
+- **Method:** LoRA (r=16, alpha=32) on frozen base model
+- **Data:** [train-10k-balanced-v3](https://huggingface.co/datasets/Denali-AI/train-10k-balanced-v3) — 10,000 curated samples
+- **Objective:** Teach valid JSON output format and basic field extraction
+- **Key outcome:** 100% JSON parse rate
+### Stage 2: Group Relative Policy Optimization (GRPO)
+- **Method:** Reward-based RL without a critic model
+- **Reward engine:** 3-layer scoring system
+  - Layer 1: JSON validity gate (binary)
+  - Layer 2: Structural correctness (20% weight)
+  - Layer 3: Per-field content accuracy (80% weight)
+- **Key outcome:** Improved field-level accuracy, especially for challenging fields
+### Stage 3: Group-relative Trajectory-based Policy Optimization (GTPO)
+- **Method:** Conflict-aware gradient optimization with entropy regularization
+- **Key outcome:** Trajectory-level coherence and reduced field-level conflicts
+---
+## Evaluation Methodology
+### Benchmark
+All models are evaluated on [**eval_hard_3500**](https://huggingface.co/datasets/Denali-AI/eval-hard-3500) — a curated benchmark of 3,500 challenging garment images selected for diversity in:
+- Garment type (tops, bottoms, dresses, outerwear, accessories)
+- Visual complexity (patterns, prints, multi-color)
+- Edge cases (ambiguous attributes, partially visible labels)
+### Metrics
+We employ a **comprehensive multi-metric evaluation framework** rather than relying on exact match:
+| Metric | Model | Description |
+|--------|-------|-------------|
+| **SBERT Cosine** | all-MiniLM-L6-v2 | Semantic similarity via sentence embeddings |
+| **NLI Score** | nli-MiniLM2-L6-H768 | Natural language inference entailment |
+| **Levenshtein Ratio** | — | Fuzzy string matching distance |
+| **Token F1** | — | Token-level precision and recall |
+| **SBERT+NLI Combined** | — | Primary metric: average of SBERT cosine and NLI |
+| **Weighted Score** | — | Field-weighted aggregate (see weights above) |
+This multi-metric approach captures semantic similarity rather than requiring exact string matches, which is critical for fields like color ("navy blue" vs "dark blue") and defect descriptions.
+### Evaluation Protocol
+- **Inference:** 8 concurrent workers via OpenAI-compatible API (vLLM)
+- **Samples:** All 3,500 samples, no subsampling
+- **Compute:** NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)
+- **Reproducibility:** Fixed prompts, deterministic sampling (temperature=0)
+---
+## Key Findings
+1. **Architecture matters more than scale.** The 2B Qwen3-VL (89.5%) outperforms the 35B Qwen3.5 MoE (50.7%) by a wide margin, largely due to the zero-shot model's inability to produce valid JSON.
+2. **SFT is non-negotiable for structured output.** All fine-tuned models achieve 100% JSON parse rate; all zero-shot models fail at 0-14%. No amount of model scale compensates for the lack of format training.
+3. **RL provides meaningful but modest gains.** GRPO+GTPO adds +1.6% weighted score over SFT-only for Qwen3.5-2B, with the largest gains on brand (+9.2pp) and defect (+5.6pp).
+4. **FP8 quantization is effectively free.** InternVL3-2B loses <1% accuracy with FP8, while gaining 21% throughput improvement (11.8 vs 14.3 samples/s).
+5. **Brand and size are the hardest fields.** Even the best model (v9) achieves only 89.3% on brand and 95.8% on size, while defect detection reaches 97.2%.
+---
+## Datasets
+| Dataset | Samples | Purpose | Link |
+|---------|:-------:|---------|------|
+| **eval_hard_3500** | 3,500 | Evaluation benchmark (hard subset) | [Link](https://huggingface.co/datasets/Denali-AI/eval-hard-3500) |
+| **train_10k_balanced_v3** | 10,000 | Training data (balanced sampling) | [Link](https://huggingface.co/datasets/Denali-AI/train-10k-balanced-v3) |
+---
+## Citation
+```bibtex
+@misc{denali-ai-2026,
+  title={Structured Garment Attribute Extraction via Multi-Stage Reinforcement Learning},
+  author={Denali AI},
+  year={2026},
+  publisher={HuggingFace},
+  url={https://huggingface.co/Denali-AI}
+}
+```
+## License
+All models and datasets are released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
+## Contact
+- **Organization:** [Denali Advanced Integration](https://denaliai.com)
+- **Issues:** [GitHub](https://github.com/Denali-AI)
+- **HuggingFace:** [Denali-AI](https://huggingface.co/Denali-AI)